数据库内部结构

Database Internals

深入探讨分布式数据系统的工作原理

A Deep Dive into How Distributed Data Systems Work

亚历克斯·彼得罗夫

Alex Petrov

数据库内部结构

Database Internals

亚历 克斯 · 彼得罗夫

by Alex Petrov

美国印刷。

Printed in the United States of America.

由O'Reilly Media, Inc. 出版 ,地址:1005 Gravenstein Highway North, Sebastopol, CA 95472。

Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.

购买 O'Reilly 书籍可用于教育、商业或促销用途。大多数图书也提供在线版本 ( http://oreilly.com )。欲了解更多信息,请联系我们的企业/机构销售部门:800-998-9938或 corporate@oreilly.com

O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com.

  • 收购编辑: Mike Loukides
  • Acquisitions Editor: Mike Loukides
  • 开发编辑: 米歇尔·克罗宁
  • Development Editor: Michele Cronin
  • 制作编辑: 克里斯托弗·福彻
  • Production Editor: Christopher Faucher
  • 文案编辑: Kim Cofer
  • Copyeditor: Kim Cofer
  • 校对: 索尼娅·萨鲁巴
  • Proofreader: Sonia Saruba
  • 索引器: 朱迪思·麦康维尔
  • Indexer: Judith McConville
  • 室内设计师: 大卫·富塔托
  • Interior Designer: David Futato
  • 封面设计: 凯伦·蒙哥马利
  • Cover Designer: Karen Montgomery
  • 插画师: 丽贝卡·德马雷斯特
  • Illustrator: Rebecca Demarest
  • 2019 年 10 月: 第一版
  • October 2019: First Edition

第一版的修订历史

Revision History for the First Edition

  • 2019-09-12: 首次发布
  • 2019-09-12: First Release

有关发布详细信息,请参阅 http://oreilly.com/catalog/errata.csp?isbn=9781492040347

See http://oreilly.com/catalog/errata.csp?isbn=9781492040347 for release details.

奉献精神

Dedication

感谢 Pieter Hintjens,我从他那里得到了我的第一本签名书:

一位鼓舞人心的分布式系统程序员、作家、哲学家和朋友。

To Pieter Hintjens, from whom I got my first ever signed book:

an inspiring distributed systems programmer, author, philosopher, and friend.

前言

Preface

分布式数据库系统是大多数企业和绝大多数软件应用程序不可或缺的一部分。这些应用程序提供逻辑和用户界面,而数据库系统负责数据完整性、一致性和冗余。

Distributed database systems are an integral part of most businesses and the vast majority of software applications. These applications provide logic and a user interface, while database systems take care of data integrity, consistency, and redundancy.

早在2000年,如果你要选择数据库,你的选择就很少,而且大多数都属于关系数据库的范围,所以它们之间的差异相对较小。当然,这并不意味着所有数据库都完全相同,但它们的功能和用例非常相似。

Back in 2000, if you were to choose a database, you would have just a few options, and most of them would be within the realm of relational databases, so differences between them would be relatively small. Of course, this does not mean that all databases were completely the same, but their functionality and use cases were very similar.

一些这些数据库中的一些数据库侧重于水平扩展(横向扩展)——通过运行充当单个逻辑单元的多个数据库实例来提高性能和增加容量:Gamma 数据库机项目、Teradata、Greenplum、并行 DB2 等。如今,水平扩展仍然是客户期望数据库最重要的属性之一。这可以通过基于云的服务的日益普及来解释。启动新实例并将其添加到集群通常比通过将数据库移动到更大、更强大的机器来垂直扩展(扩展) 。迁移可能是漫长而痛苦的,可能会导致停机。

Some of these databases have focused on horizontal scaling (scaling out)—improving performance and increasing capacity by running multiple database instances acting as a single logical unit: Gamma Database Machine Project, Teradata, Greenplum, Parallel DB2, and many others. Today, horizontal scaling remains one of the most important properties that customers expect from databases. This can be explained by the rising popularity of cloud-based services. It is often easier to spin up a new instance and add it to the cluster than scaling vertically (scaling up) by moving the database to a larger, more powerful machine. Migrations can be long and painful, potentially incurring downtime.

大约2010 年,一类新的最终一致数据库开始出现, NoSQL等术语以及后来的大数据越来越受欢迎。在过去 15 年里,开源社区、大型互联网公司和数据库供应商创建了如此多的数据库和工具,因此很容易在尝试理解用例、细节和细节时迷失方向。

Around 2010, a new class of eventually consistent databases started appearing, and terms such as NoSQL, and later, big data grew in popularity. Over the last 15 years, the open source community, large internet companies, and database vendors have created so many databases and tools that it’s easy to get lost trying to understand use cases, details, and specifics.

Amazon 团队于 2007 年发表的Dynamo 论文[DECANDIA07]对数据库社区产生了巨大影响,以至于在短时间内激发了许多变体和实现。其中最著名的是 Facebook 创建的 Apache Cassandra;Project Voldemort,在 LinkedIn 创建;和 Riak,由前 Akamai 工程师创建。

The Dynamo paper [DECANDIA07], published by the team at Amazon in 2007, had so much impact on the database community that within a short period it inspired many variants and implementations. The most prominent of them were Apache Cassandra, created at Facebook; Project Voldemort, created at LinkedIn; and Riak, created by former Akamai engineers.

如今,这个领域再次发生变化:在键值存储、NoSQL 和最终一致性时代之后,我们开始看到更具可扩展性和性能的数据库,能够以更强的一致性保证执行复杂的查询。

Today, the field is changing again: after the time of key-value stores, NoSQL, and eventual consistency, we have started seeing more scalable and performant databases, able to execute complex queries with stronger consistency guarantees.

本书的读者

Audience of This Book

在技​​术会议的对话中,我经常听到同样的问题:“我怎样才能了解更多关于数据库内部的知识?我什至不知道从哪里开始。” 大多数关于数据库系统的书籍都没有详细讨论存储引擎的实现细节,而是在相当高的层面上介绍了访问方法,例如B-Tree。很少有书籍涵盖更新的概念,例如不同的 B 树变体和日志结构存储,因此我通常建议阅读论文。

In conversations at technical conferences, I often hear the same question: “How can I learn more about database internals? I don’t even know where to start.” Most of the books on database systems do not go into details of storage engine implementation, and cover the access methods, such as B-Trees, on a rather high level. There are very few books that cover more recent concepts, such as different B-Tree variants and log-structured storage, so I usually recommend reading papers.

每个阅读论文的人都知道这并不容易:你经常缺乏上下文,措辞可能含糊不清,论文之间几乎没有联系,而且很难找到。本书包含重要数据库系统概念的简明摘要,可以作为那些想要深入研究的人的指南,也可以作为那些已经熟悉这些概念的人的备忘单。

Everyone who reads papers knows that it’s not that easy: you often lack context, the wording might be ambiguous, there’s little or no connection between papers, and they’re hard to find. This book contains concise summaries of important database systems concepts and can serve as a guide for those who’d like to dig in deeper, or as a cheat sheet for those already familiar with these concepts.

并不是每个人都想成为一名数据库开发人员,但这本书将帮助那些构建使用数据库系统的软件的人:软件开发人员、可靠性工程师、架构师和工程经理。

Not everyone wants to become a database developer, but this book will help people who build software that uses database systems: software developers, reliability engineers, architects, and engineering managers.

如果您的公司依赖于任何基础设施组件,无论是数据库、消息队列、容器平台还是任务调度程序,您都必须阅读项目变更日志和邮件列表,以便与社区保持联系并保持最新状态。最新项目中发生的事情。了解术语并了解其内部内容将使您能够从这些来源获得更多信息,并更有效地使用您的工具来排除故障、识别和避免潜在的风险和瓶颈。对数据库系统如何工作有一个概述和总体了解将有助于防止出现问题。利用这些知识,您将能够形成假设,验证它,找到根本原因,并将其呈现给其他项目维护人员。

If your company depends on any infrastructure component, be it a database, a messaging queue, a container platform, or a task scheduler, you have to read the project change-logs and mailing lists to stay in touch with the community and be up-to-date with the most recent happenings in the project. Understanding terminology and knowing what’s inside will enable you to yield more information from these sources and use your tools more productively to troubleshoot, identify, and avoid potential risks and bottlenecks. Having an overview and a general understanding of how database systems work will help in case something goes wrong. Using this knowledge, you’ll be able to form a hypothesis, validate it, find the root cause, and present it to other project maintainers.

本书也适合好奇心强的人:适合那些喜欢学习不需要立即学习的东西的人,那些花空闲时间研究一些有趣的东西、创建编译器、编写自主开发的操作系统、文本编辑器、计算机游戏、学习编程语言并吸收知识的人。新的信息。

This book is also for curious minds: for the people who like learning things without immediate necessity, those who spend their free time hacking on something fun, creating compilers, writing homegrown operating systems, text editors, computer games, learning programming languages, and absorbing new information.

假定读者具有一定的开发后端系统和作为用户使用数据库系统的经验。对不同数据结构有一些先验知识将有助于更快地消化材料。

The reader is assumed to have some experience with developing backend systems and working with database systems as a user. Having some prior knowledge of different data structures will help to digest material faster.

我为什么要读这本书?

Why Should I Read This Book?

我们经常听到人们根据他们实现的概念和算法来描述数据库系统:“这个数据库使用八卦来进行成员传播”(参见第 12 章),“他们已经实现了 Dynamo”,或者“这就像他们所描述的那样”在 Spanner 论文中”(见第 13 章)。或者,如果您正在讨论算法和数据结构,您可以听到诸如“ZAB 和 Raft 有很多共同点”(请参阅​​第 14 章)、“ Bw -Trees 就像在日志结构之上实现的 B-Trees”之类的内容。存储”(参见第 6 章),或“它们使用 B链接树中的同级指针”(参见第 5 章)。

We often hear people describing database systems in terms of the concepts and algorithms they implement: “This database uses gossip for membership propagation” (see Chapter 12), “They have implemented Dynamo,” or “This is just like what they’ve described in the Spanner paper” (see Chapter 13). Or, if you’re discussing the algorithms and data structures, you can hear something like “ZAB and Raft have a lot in common” (see Chapter 14), “Bw-Trees are like the B-Trees implemented on top of log structured storage” (see Chapter 6), or “They are using sibling pointers like in Blink-Trees” (see Chapter 5).

我们需要抽象来讨论复杂的概念,并且我们不能每次开始对话时都讨论术语。拥有共同语言形式的捷径可以帮助我们将注意力转移到其他更高层次的问题上。

We need abstractions to discuss complex concepts, and we can’t have a discussion about terminology every time we start a conversation. Having shortcuts in the form of common language helps us to move our attention to other, higher-level problems.

学习基本概念、证明和算法的优点之一是它们永远不会变老。当然,总会有新算法出现,但新算法通常是在发现经典算法的缺陷或改进空间后创建的。了解历史有助于更好地理解差异和动机。

One of the advantages of learning the fundamental concepts, proofs, and algorithms is that they never grow old. Of course, there will always be new ones, but new algorithms are often created after finding a flaw or room for improvement in a classical one. Knowing the history helps to understand differences and motivation better.

了解这些事情是鼓舞人心的。您会看到各种各样的算法,看到我们的行业如何解决一个又一个的问题,并开始欣赏这项工作。同时,学习是有益的:你几乎可以感觉到多个拼图如何在你的脑海中一起移动,形成一幅完整的图画,你将永远能够与他人分享。

Learning about these things is inspiring. You see the variety of algorithms, see how our industry was solving one problem after the other, and get to appreciate that work. At the same time, learning is rewarding: you can almost feel how multiple puzzle pieces move together in your mind to form a full picture that you will always be able to share with others.

本书的范围

Scope of This Book

既不是一本关于关系数据库管理系统的书,也不是关于NoSQL数据库管理系统的书,而是关于各种数据库系统中使用的算法和概念的书,重点是存储引擎和负责分发组件

This is neither a book about relational database management systems nor about NoSQL ones, but about the algorithms and concepts used in all kinds of database systems, with a focus on a storage engine and the components responsible for distribution.

一些概念,例如查询规划、查询优化、调度、关系模型以及其他一些概念,已经在几本关于数据库系统的优秀教科书中涵盖了。其中一些概念通常是从用户的角度进行描述的,但本书重点关注其内部原理。您可以在第二部分结论和章节摘要中找到一些有用文献的提示。在这些书中,您可能会找到许多与数据库相关的问题的答案。

Some concepts, such as query planning, query optimization, scheduling, the relational model, and a few others, are already covered in several great textbooks on database systems. Some of these concepts are usually described from the user’s perspective, but this book concentrates on the internals. You can find some pointers to useful literature in the Part II Conclusion and in the chapter summaries. In these books you’re likely to find answers to many database-related questions you might have.

没有讨论查询语言,因为本书提到的数据库系统之间没有单一的通用语言。

Query languages aren’t discussed, since there’s no single common language among the database systems mentioned in this book.

为了收集本书的材料,我研究了超过 15 本书、300 多篇论文、无数的博客文章、源代码以及几个开源数据库的文档。是否在书中包含某个特定概念的经验法则是这样的问题:“数据库行业和研究界的人谈论这个概念吗?” 如果答案是“是”,我就把这个概念添加到要讨论的一长串事情中。

To collect material for this book, I studied over 15 books, more than 300 papers, countless blog posts, source code, and the documentation for several open source databases. The rule of thumb for whether or not to include a particular concept in the book was the question: “Do the people in the database industry and research circles talk about this concept?” If the answer was “yes,” I added the concept to the long list of things to discuss.

本书的结构

Structure of This Book

有一些带有可插入组件的可扩展数据库的示例(例如[SCHWARZ86]),但它们相当罕见。同时,还有很多数据库使用可插拔存储的例子。同样,我们很少听到数据库供应商谈论查询执行,而他们却非常渴望讨论他们的数据库保持一致性的方式。

There are some examples of extensible databases with pluggable components (such as [SCHWARZ86]), but they are rather rare. At the same time, there are plenty of examples where databases use pluggable storage. Similarly, we rarely hear database vendors talking about query execution, while they are very eager to discuss the ways their databases preserve consistency.

数据库系统之间最显着的区别集中在两个方面:它们如何存储以及如何分发数据。(其他子系统有时也很重要,但这里不予介绍。)本书分为几个部分,讨论负责存储第一部分)和分发(第二部分)的子系统和组件。

The most significant distinctions between database systems are concentrated around two aspects: how they store and how they distribute the data. (Other subsystems can at times also be of importance, but are not covered here.) The book is arranged into parts that discuss the subsystems and components responsible for storage (Part I) and distribution (Part II).

第一部分讨论节点本地进程,并重点关注存储引擎,它是数据库系统的核心组件,也是最重要的独特因素之一。首先,我们从数据库管理系统的体系结构开始,并提出几种根据主存储介质和布局对数据库系统进行分类的方法。

Part I discusses node-local processes and focuses on the storage engine, the central component of the database system and one of the most significant distinctive factors. First, we start with the architecture of a database management system and present several ways to classify database systems based on the primary storage medium and layout.

我们继续讨论存储结构,并尝试了解基于磁盘的结构与内存中的结构有何不同,介绍 B 树,并介绍在磁盘上有效维护 B 树结构的算法,包括序列化、页面布局和磁盘上的 B 树结构。交涉。随后,我们将讨论多种变体,以说明这一概念的强大功能以及受 B 树影响和启发的数据结构的多样性。

We continue with storage structures and try to understand how disk-based structures are different from in-memory ones, introduce B-Trees, and cover algorithms for efficiently maintaining B-Tree structures on disk, including serialization, page layout, and on-disk representations. Later, we discuss multiple variants to illustrate the power of this concept and the diversity of data structures influenced and inspired by B-Trees.

最后,我们讨论日志结构存储的几种变体,通常用于实现文件和存储系统,以及使用它们的动机和原因。

Last, we discuss several variants of log-structured storage, commonly used for implementing file and storage systems, motivation, and reasons to use them.

第二部分是关于如何将多个节点组织成数据库集群。我们首先了解构建容错分布式系统的理论概念的重要性,分布式系统与单节点应用程序有何不同,以及我们在分布式环境中面临哪些问题、约束和复杂性。

Part II is about how to organize multiple nodes into a database cluster. We start with the importance of understanding the theoretical concepts for building fault-tolerant distributed systems, how distributed systems are different from single-node applications, and which problems, constraints, and complications we face in a distributed environment.

之后,我们深入研究分布式算法。在这里,我们从故障检测算法开始,通过注意到和报告故障并避免故障节点来帮助提高性能和稳定性。由于本书后面讨论的许多算法都依赖于对领导力概念的理解,因此我们介绍了几种领导者选举算法并讨论了它们的适用性。

After that, we dive deep into distributed algorithms. Here, we start with algorithms for failure detection, helping to improve performance and stability by noticing and reporting failures and avoiding the failed nodes. Since many algorithms discussed later in the book rely on understanding the concept of leadership, we introduce several algorithms for leader election and discuss their suitability.

由于分布式系统中最困难的事情之一是实现数据一致性,因此我们讨论复制的概念,然后讨论一致性模型、副本之间可能存在的分歧以及最终的一致性。由于最终一致的系统有时依赖于反熵来收敛和八卦来进行数据传播,因此我们讨论了几种反熵和八卦方法。最后,我们讨论数据库事务上下文中的逻辑一致性,并以共识算法结束。

As one of the most difficult things in distributed systems is achieving data consistency, we discuss concepts of replication, followed by consistency models, possible divergence between replicas, and eventual consistency. Since eventually consistent systems sometimes rely on anti-entropy for convergence and gossip for data dissemination, we discuss several anti-entropy and gossip approaches. Finally, we discuss logical consistency in the context of database transactions, and finish with consensus algorithms.

如果没有所有的研究和出版物,就不可能写出这本书。您会在文本中找到许多对论文和出版物的引用,这些引用位于方括号中,并且是等宽字体;例如,[DECANDIA07]。您可以使用这些参考资料来更详细地了解相关概念。

It would’ve been impossible to write this book without all the research and publications. You will find many references to papers and publications in the text, in square brackets with monospace font; for example, [DECANDIA07]. You can use these references to learn more about related concepts in more detail.

每章之后,您都会找到一个摘要部分,其中包含与该章内容相关的进一步学习的材料。

After each chapter, you will find a summary section that contains material for further study, related to the content of the chapter.

本书中使用的约定

Conventions Used in This Book

本书使用以下印刷约定:

The following typographical conventions are used in this book:

斜体
Italic

表示新术语、URL、电子邮件地址、文件名和文件扩展名。

Indicates new terms, URLs, email addresses, filenames, and file extensions.

Constant width
Constant width

用于程序列表,以及在段落中引用程序元素,例如变量或函数名称、数据库、数据类型、环境变量、语句和关键字。

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.

提示

该元素表示提示或建议。

This element signifies a tip or suggestion.

笔记

该元素表示一般注释。

This element signifies a general note.

警告

该元素表示警告或警告。

This element indicates a warning or caution.

使用代码示例

Using Code Examples

这本书可以帮助您完成工作。一般来说,如果本书提供了示例代码,您就可以在您的程序和文档中使用它。除非您要复制大部分代码,否则您无需联系我们以获得许可。例如,编写使用本书中的几段代码的程序不需要许可。销售或分发 O'Reilly 书籍中示例的 CD-ROM 确实需要许可。通过引用本书和示例代码来回答问题不需要许可。将本书中的大量示例代码合并到您的产品文档中确实需要许可。

This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing a CD-ROM of examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

我们赞赏但不要求归属。归属通常包括标题、作者、出版商和 ISBN。例如:“数据库内部结构,作者:Alex Petrov (O'Reilly)。版权所有 2019 亚历山大·彼得罗夫,978-1-492-04034-7。”

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Database Internals by Alex Petrov (O’Reilly). Copyright 2019 Oleksandr Petrov, 978-1-492-04034-7.”

如果您认为您对代码示例的使用不符合合理使用或上述许可的范围,请随时通过与我们联系。

If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at .

奥莱利在线学习

O’Reilly Online Learning

笔记

近 40 年来,O'Reilly Media一直提供技术和业务培训、知识和见解来帮助公司取得成功。

For almost 40 years, O’Reilly Media has provided technology and business training, knowledge, and insight to help companies succeed.

我们独特的专家和创新者网络通过书籍、文章、会议和我们的在线学习平台分享他们的知识和专业知识。O'Reilly 的在线学习平台让您可以按需访问实时培训课程、深入学习路径、交互式编码环境以及来自 O'Reilly 和 200 多家其他出版商的大量文本和视频。欲了解更多信息,请访问http://oreilly.com

Our unique network of experts and innovators share their knowledge and expertise through books, articles, conferences, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, please visit http://oreilly.com.

如何联系我们

How to Contact Us

向出版商提出有关本书的评论和问题:

Please address comments and questions concerning this book to the publisher:

  • 奥莱利媒体公司
  • O’Reilly Media, Inc.
  • 格拉文斯坦公路北1005号
  • 1005 Gravenstein Highway North
  • 塞瓦斯托波尔, CA 95472
  • Sebastopol, CA 95472
  • 800-998-9938(美国或加拿大)
  • 800-998-9938 (in the United States or Canada)
  • 707-829-0515(国际或本地)
  • 707-829-0515 (international or local)
  • 707-829-0104(传真)
  • 707-829-0104 (fax)

我们有本书的网页,其中列出了勘误表、示例和任何其他信息。您可以通过http://bit.ly/database-internals访问此页面。

We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at http://bit.ly/database-internals.

要评论或询问有关本书的技术问题,请发送电子邮件至

To comment or ask technical questions about this book, please send an email to .

有关我们的书籍、课程、会议和新闻的更多信息,请访问我们的网站http://www.oreilly.com

For more information about our books, courses, conferences, and news, see our website at http://www.oreilly.com.

在 Facebook 上找到我们: http: //facebook.com/oreilly

Find us on Facebook: http://facebook.com/oreilly

在 Twitter 上关注我们: http: //twitter.com/oreillymedia

Follow us on Twitter: http://twitter.com/oreillymedia

在 YouTube 上观看我们的视频: http: //www.youtube.com/oreillymedia

Watch us on YouTube: http://www.youtube.com/oreillymedia

致谢

Acknowledgments

如果没有成百上千位为研究论文和书籍付出辛勤劳动的人,这本书是不可能完成的,他们是本书的思想、灵感和参考资料的源泉。

This book wouldn’t have been possible without the hundreds of people who have worked hard on research papers and books, which have been a source of ideas, inspiration, and served as references for this book.

我要向所有审阅手稿并提供反馈的人表示感谢,以确保本书中的材料正确且措辞准确:Dmitry Alimov、Peter Alvaro、Carlos Baquero、Jason Brown、Blake Eggleston、马库斯·埃里克森、弗朗西斯科·费尔南德斯·卡斯塔尼奥、海蒂·霍华德、瓦德希·乔希、马克西米利安·卡拉兹、斯塔斯·凯尔维奇、迈克尔·克里辛、普雷德拉格·克内泽维奇、乔尔·奈顿、尤金·拉辛、内特·麦考尔、克里斯托弗·米克尔约翰、泰勒·尼利、马克西姆·内罗夫、玛丽娜·佩特洛娃、斯特凡·波德科文斯基、爱德华里比罗、丹尼斯·里佐夫、基尔·沙特罗夫、亚历克斯·索罗库莫夫、马西米利亚诺·托马西和阿里尔·韦斯伯格。

I’d like to say thank you to all the people who reviewd manuscripts and provided feedback, making sure that the material in this book is correct and the wording is precise: Dmitry Alimov, Peter Alvaro, Carlos Baquero, Jason Brown, Blake Eggleston, Marcus Eriksson, Francisco Fernández Castaño, Heidi Howard, Vaidehi Joshi, Maximilian Karasz, Stas Kelvich, Michael Klishin, Predrag Knežević, Joel Knighton, Eugene Lazin, Nate McCall, Christopher Meiklejohn, Tyler Neely, Maxim Neverov, Marina Petrova, Stefan Podkowinski, Edward Ribiero, Denis Rytsov, Kir Shatrov, Alex Sorokoumov, Massimiliano Tomassi, and Ariel Weisberg.

当然,如果没有我的家人的支持,这本书是不可能完成的:我的妻子玛丽娜和我的女儿亚历山德拉,他们在我前进的每一步上都给予我支持。

Of course, this book wouldn’t have been possible without support from my family: my wife Marina and my daughter Alexandra, who have supported me on every step on the way.

第一部分:存储引擎

Part I. Storage Engines

任何人的主要工作数据库管理系统可靠地存储数据并可供用户使用。我们使用数据库作为主要数据源,帮助我们在应用程序的不同部分之间共享数据。我们每次创建新应用程序时都不会寻找存储和检索信息的方法并发明一种组织数据的新方法,而是使用数据库。这样我们就可以专注于应用程序逻辑而不是基础设施。

The primary job of any database management system is reliably storing data and making it available for users. We use databases as a primary source of data, helping us to share it between the different parts of our applications. Instead of finding a way to store and retrieve information and inventing a new way to organize data every time we create a new app, we use databases. This way we can concentrate on application logic instead of infrastructure.

自从术语数据库管理系统(DBMS)相当庞大,在本书中我们使用更紧凑的术语“数据库系统”“数据库”来指代相同的概念。

Since the term database management system (DBMS) is quite bulky, throughout this book we use more compact terms, database system and database, to refer to the same concept.

数据库是模块化系统,由多个部分组成:接受请求的传输层、确定运行查询的最有效方式的查询处理器、执行操作的执行引擎以及存储引擎(请参阅“DBMS 架构”

Databases are modular systems and consist of multiple parts: a transport layer accepting requests, a query processor determining the most efficient way to run queries, an execution engine carrying out the operations, and a storage engine (see “DBMS Architecture”).

存储引擎(或数据库引擎)是数据库管理系统的软件组件,负责存储、检索和管理内存和磁盘中的数据,旨在捕获每个节点的持久、长期内存[REED78 ]。虽然数据库可以响应复杂的查询,但存储引擎可以更精细地查看数据并提供简单的数据操作 API,允许用户创建、更新、删除和检索记录。看待这个问题的一种方法是,数据库管理系统是构建在存储引擎之上的应用程序,提供模式、查询语言、索引、事务和许多其他有用的功能。

The storage engine (or database engine) is a software component of a database management system responsible for storing, retrieving, and managing data in memory and on disk, designed to capture a persistent, long-term memory of each node [REED78]. While databases can respond to complex queries, storage engines look at the data more granularly and offer a simple data manipulation API, allowing users to create, update, delete, and retrieve records. One way to look at this is that database management systems are applications built on top of storage engines, offering a schema, a query language, indexing, transactions, and many other useful features.

为了灵活性,键和值都可以是任意字节序列,没有规定的形式。它们的排序和表示语义在更高级别的子系统中定义。例如,您可以int32在其中一个表中使用(32 位整数)作为键,ascii在另一个表中使用(ASCII 字符串);从存储引擎的角度来看,两个键都只是序列化的条目。

For flexibility, both keys and values can be arbitrary sequences of bytes with no prescribed form. Their sorting and representation semantics are defined in higher-level subsystems. For example, you can use int32 (32-bit integer) as a key in one of the tables, and ascii (ASCII string) in the other; from the storage engine perspective both keys are just serialized entries.

贮存BerkeleyDBLevelDB及其后代RocksDBLMDB及其后代libmdbxSophiaHaloDB等引擎是独立于它们现在嵌入的数据库管理系统开发的。使用可插拔存储引擎使数据库开发人员能够使用现有存储引擎引导数据库系统,并专注于其他子系统。

Storage engines such as BerkeleyDB, LevelDB and its descendant RocksDB, LMDB and its descendant libmdbx, Sophia, HaloDB, and many others were developed independently from the database management systems they’re now embedded into. Using pluggable storage engines has enabled database developers to bootstrap database systems using existing storage engines, and concentrate on the other subsystems.

同时,数据库系统组件之间的明确分离提供了在不同引擎之间切换的机会,可能更适合特定用例。为了例如,流行的数据库管理系统MySQL有多种存储引擎,包括InnoDB、MyISAM和RocksDB(在MyRocks发行版中)。MongoDB允许在WiredTiger、In-Memory 和(现已弃用)MMAPv1存储引擎之间切换。

At the same time, clear separation between database system components opens up an opportunity to switch between different engines, potentially better suited for particular use cases. For example, MySQL, a popular database management system, has several storage engines, including InnoDB, MyISAM, and RocksDB (in the MyRocks distribution). MongoDB allows switching between WiredTiger, In-Memory, and the (now-deprecated) MMAPv1 storage engines.

第一部分:比较数据库

Part I. Comparing Databases

你的数据库系统的选择可能会产生长期影响。如果由于性能问题、一致性问题或操作挑战而导致数据库不适合,最好在开发周期的早期发现它,因为迁移到不同的系统可能很重要。在某些情况下,可能需要对应用程序代码进行大量更改。

Your choice of database system may have long-term consequences. If there’s a chance that a database is not a good fit because of performance problems, consistency issues, or operational challenges, it is better to find out about it earlier in the development cycle, since it can be nontrivial to migrate to a different system. In some cases, it may require substantial changes in the application code.

每个数据库系统都有优点和缺点。为了降低昂贵的迁移风险,您可以在决定特定数据库之前投入一些时间,以建立对其满足应用程序需求的能力的信心。

Every database system has strengths and weaknesses. To reduce the risk of an expensive migration, you can invest some time before you decide on a specific database to build confidence in its ability to meet your application’s needs.

尝试根据数据库的组件(例如,它们使用哪种存储引擎,数据如何共享、复制和分发等)、排名(由ThoughtWorks等咨询机构或数据库比较网站指定的任意流行度值来比较数据库)例如DB-EnginesDatabase of Databases),或实现语言(C++、Java 或 Go 等)都可能导致无效和过早的结论。这些方法只能用于高层比较,并且可以像在 HBase 和 SQLite 之间进行选择一样粗略,因此即使对每个数据库的工作原理及其内部内容有肤浅的了解也可以帮助您得出更有说服力的结论。

Trying to compare databases based on their components (e.g., which storage engine they use, how the data is shared, replicated, and distributed, etc.), their rank (an arbitrary popularity value assigned by consultancy agencies such as ThoughtWorks or database comparison websites such as DB-Engines or Database of Databases), or implementation language (C++, Java, or Go, etc.) can lead to invalid and premature conclusions. These methods can be used only for a high-level comparison and can be as coarse as choosing between HBase and SQLite, so even a superficial understanding of how each database works and what’s inside it can help you land a more weighted conclusion.

每次比较都应该从明确定义目标开始,因为即使是最轻微的偏差也可能使整个调查完全无效。如果您正在寻找一个非常适合您拥有的工作负载(或计划促进)的数据库,那么您能做的最好的事情就是针对不同的数据库系统模拟这些工作负载,衡量重要的性能指标为您提供,并比较结果。有些问题,尤其是性能和可扩展性方面的问题,仅在一段时间后或随着容量的增长而开始显现。为了检测潜在的问题,最好在尽可能模拟真实生产设置的环境中进行长时间运行的测试。

Every comparison should start by clearly defining the goal, because even the slightest bias may completely invalidate the entire investigation. If you’re searching for a database that would be a good fit for the workloads you have (or are planning to facilitate), the best thing you can do is to simulate these workloads against different database systems, measure the performance metrics that are important for you, and compare results. Some issues, especially when it comes to performance and scalability, start showing only after some time or as the capacity grows. To detect potential problems, it is best to have long-running tests in an environment that simulates the real-world production setup as closely as possible.

模拟现实世界的工作负载不仅可以帮助您了解数据库的性能,还可以帮助您学习如何操作、调试,并了解其社区是多么友好和乐于助人。数据库选择始终是这些因素的组合,而性能通常并不是最重要的方面:使用缓慢保存数据的数据库通常比使用快速丢失数据的数据库要好得多。

Simulating real-world workloads not only helps you understand how the database performs, but also helps you learn how to operate, debug, and find out how friendly and helpful its community is. Database choice is always a combination of these factors, and performance often turns out not to be the most important aspect: it’s usually much better to use a database that slowly saves the data than one that quickly loses it.

要比较数据库,详细了解用例并定义当前和预期变量会很有帮助,例如:

To compare databases, it’s helpful to understand the use case in great detail and define the current and anticipated variables, such as:

  • 架构和记录大小

  • Schema and record sizes

  • 客户数量

  • Number of clients

  • 查询类型和访问模式

  • Types of queries and access patterns

  • 读取和写入查询的速率

  • Rates of the read and write queries

  • 这些变量的预期变化

  • Expected changes in any of these variables

了解这些变量有助于回答以下问题:

Knowing these variables can help to answer the following questions:

  • 数据库是否支持所需的查询?

  • Does the database support the required queries?

  • 该数据库能够处理我们计划存储的数据量吗?

  • Is this database able to handle the amount of data we’re planning to store?

  • 单个节点可以处理多少个读写操作?

  • How many read and write operations can a single node handle?

  • 系统应该有多少个节点?

  • How many nodes should the system have?

  • 考虑到预期的增长率,我们如何扩展集群?

  • How do we expand the cluster given the expected growth rate?

  • 维护流程是怎样的?

  • What is the maintenance process?

回答了这些问题后,您可以构建测试集群并模拟您的工作负载。大多数数据库已经拥有可用于重建特定用例的压力工具。如果没有标准的压力工具可以在数据库生态系统中生成实际的随机工作负载,那么这可能是一个危险信号。如果某些原因阻止您使用默认工具,您可以尝试现有的通用工具之一,或者从头开始实现一个。

Having these questions answered, you can construct a test cluster and simulate your workloads. Most databases already have stress tools that can be used to reconstruct specific use cases. If there’s no standard stress tool to generate realistic randomized workloads in the database ecosystem, it might be a red flag. If something prevents you from using default tools, you can try one of the existing general-purpose tools, or implement one from scratch.

如果测试显示出积极的结果,那么熟悉数据库代码可能会有所帮助。查看代码时,首先了解数据库的各个部分、如何查找不同组件的代码,然后浏览这些代码通常很有用。即使对数据库代码库有一个粗略的了解,也可以帮助您更好地理解它生成的日志记录、其配置参数,并帮助您发现使用它的应用程序甚至数据库代码本身的问题。

If the tests show positive results, it may be helpful to familiarize yourself with the database code. Looking at the code, it is often useful to first understand the parts of the database, how to find the code for different components, and then navigate through those. Having even a rough idea about the database codebase helps you better understand the log records it produces, its configuration parameters, and helps you find issues in the application that uses it and even in the database code itself.

如果我们可以将数据库用作黑匣子并且永远不必查看它们的内部,那就太好了,但实践表明,迟早会出现错误、中断、性能回归或其他一些问题,而且这是最好为此做好准备。如果您了解并了解数据库内部结构,则可以降低业务风险并提高快速恢复的机会。

It’d be great if we could use databases as black boxes and never have to take a look inside them, but the practice shows that sooner or later a bug, an outage, a performance regression, or some other problem pops up, and it’s better to be prepared for it. If you know and understand database internals, you can reduce business risks and improve chances for a quick recovery.

用于基准测试、性能评估和比较的流行工具之一雅虎!云服务基准(YCSB)。YCSB 提供了一个框架和一组通用的工作负载,可应用于不同的数据存储。就像任何通用工具一样,应谨慎使用此工具,因为很容易得出错误的结论。为了进行公平的比较并做出明智的决定,有必要投入足够的时间来了解数据库必须执行的现实条件,并相应地调整基准

One of the popular tools used for benchmarking, performance evaluation, and comparison is Yahoo! Cloud Serving Benchmark (YCSB). YCSB offers a framework and a common set of workloads that can be applied to different data stores. Just like anything generic, this tool should be used with caution, since it’s easy to make wrong conclusions. To make a fair comparison and make an educated decision, it is necessary to invest enough time to understand the real-world conditions under which the database has to perform, and tailor benchmarks accordingly.

这并不意味着基准只能用于比较数据库。基准可用于定义和测试服务级别协议的详细信息、1了解系统要求、容量规划等。在使用数据库之前对数据库了解得越多,在生产中运行数据库时节省的时间就越多。

This doesn’t mean that benchmarks can be used only to compare databases. Benchmarks can be useful to define and test details of the service-level agreement,1 understanding system requirements, capacity planning, and more. The more knowledge you have about the database before using it, the more time you’ll save when running it in production.

选择数据库是一个长期的决定,最好跟踪新发布的版本,了解到底发生了什么变化以及原因,并制定升级策略。新版本通常包含错误和安全问题的改进和修复,但可能会引入新的错误、性能下降或意外行为,因此在推出新版本之前对其进行测试也至关重要。检查数据库实施者之前如何处理升级可能会让您对未来的预期有一个很好的了解。过去的顺利升级并不能保证未来的升级也会同样顺利,但过去复杂的升级可能预示着未来的升级也不会那么容易。

Choosing a database is a long-term decision, and it’s best to keep track of newly released versions, understand what exactly has changed and why, and have an upgrade strategy. New releases usually contain improvements and fixes for bugs and security issues, but may introduce new bugs, performance regressions, or unexpected behavior, so testing new versions before rolling them out is also critical. Checking how database implementers were handling upgrades previously might give you a good idea about what to expect in the future. Past smooth upgrades do not guarantee that future ones will be as smooth, but complicated upgrades in the past might be a sign that future ones won’t be easy, either.

第一部分:了解权衡

Part I. Understanding Trade-Offs

作为用户,我们可以看到数据库在不同条件下的行为方式,但是在处理数据库时,我们必须做出直接影响这种行为的选择。

As users, we can see how databases behave under different conditions, but when working on databases, we have to make choices that influence this behavior directly.

设计存储引擎肯定比仅仅实现教科书数据结构更复杂:有许多细节和边缘情况很难从一开始就搞清楚。我们需要设计物理数据布局和组织指针,决定序列化格式,了解数据将如何被垃圾收集,存储引擎如何适应整个数据库系统的语义,弄清楚如何使它在并发环境中工作,最后,确保我们在任何情况下都不会丢失任何数据。

Designing a storage engine is definitely more complicated than just implementing a textbook data structure: there are many details and edge cases that are hard to get right from the start. We need to design the physical data layout and organize pointers, decide on the serialization format, understand how data is going to be garbage-collected, how the storage engine fits into the semantics of the database system as a whole, figure out how to make it work in a concurrent environment, and, finally, make sure we never lose any data, under any circumstances.

不仅有很多事情需要决定,而且大多数决定都涉及权衡。例如,如果我们按照插入数据库的顺序保存记录,我们可以更快地存储它们,但如果我们按照字典顺序检索它们,则必须在将结果返回给客户端之前对它们重新排序。正如您将在本书中看到的那样,存储引擎设计有许多不同的方法,每种实现都有其自身的优点和缺点。

Not only there are many things to decide upon, but most of these decisions involve trade-offs. For example, if we save records in the order they were inserted into the database, we can store them quicker, but if we retrieve them in their lexicographical order, we have to re-sort them before returning results to the client. As you will see throughout this book, there are many different approaches to storage engine design, and every implementation has its own upsides and downsides.

在研究不同的存储引擎时,我们讨论它们的优点和缺点。如果对于每个可以想象的用例都有一个绝对最佳的存储引擎,那么每个人都会使用它。但由于它不存在,我们需要根据我们试图促进的工作负载和用例做出明智的选择。

When looking at different storage engines, we discuss their benefits and shortcomings. If there was an absolutely optimal storage engine for every conceivable use case, everyone would just use it. But since it does not exist, we need to choose wisely, based on the workloads and use cases we’re trying to facilitate.

有许多存储引擎,使用各种数据结构,以不同的语言实现,从低级语言(例如 C)到高级语言(例如 Java)。所有存储引擎都面临着相同的挑战和限制。与城市规划类似,可以为特定人群建造一座城市,并选择建造建造。在这两种情况下,相同数量的人都会融入城市,但这些方法会导致截然不同的生活方式。城市建设时,人们住在公寓里,人口密度可能会导致较小区域内的交通量增加;在一个更加分散的城市,人们更有可能住在房子里,但通勤需要走更长的距离。

There are many storage engines, using all sorts of data structures, implemented in different languages, ranging from low-level ones, such as C, to high-level ones, such as Java. All storage engines face the same challenges and constraints. To draw a parallel with city planning, it is possible to build a city for a specific population and choose to build up or build out. In both cases, the same number of people will fit into the city, but these approaches lead to radically different lifestyles. When building the city up, people live in apartments and population density is likely to lead to more traffic in a smaller area; in a more spread-out city, people are more likely to live in houses, but commuting will require covering larger distances.

同样,存储引擎开发人员做出的设计决策使它们更适合不同的事情:有些针对低读取或写入延迟进行了优化,有些尝试最大化密度(每个节点存储的数据量),有些则专注于操作简单性。

Similarly, design decisions made by storage engine developers make them better suited for different things: some are optimized for low read or write latency, some try to maximize density (the amount of stored data per node), and some concentrate on operational simplicity.

您可以在章节摘要中找到可用于实现的完整算法以及其他附加参考资料。阅读本书将使您做好充分准备,能够有效地使用这些资源,并让您充分理解其中描述的概念的现有替代方案。

You can find complete algorithms that can be used for the implementation and other additional references in the chapter summaries. Reading this book should make you well equipped to work productively with these sources and give you a solid understanding of the existing alternatives to concepts described there.

1服务级别协议(或 SLA)是服务提供商对所提供服务质量的承诺。除此之外,SLA 还可以包括有关延迟、吞吐量、抖动以及故障数量和频率的信息。

1 The service-level agreement (or SLA) is a commitment by the service provider about the quality of provided services. Among other things, the SLA can include information about latency, throughput, jitter, and the number and frequency of failures.

第 1 章简介和概述

Chapter 1. Introduction and Overview

数据库管理系统可以用于不同的目的:有些主要用于临时数据,有些用作长期存储,有些允许复杂的分析查询,有些只允许通过键访问值,有些经过优化以存储时间序列数据,还有有些可以有效地存储大块。为了理解差异并区分,我们从简短的分类和概述开始,因为这有助于我们理解进一步讨论的范围。

Database management systems can serve different purposes: some are used primarily for temporary hot data, some serve as a long-lived cold storage, some allow complex analytical queries, some only allow accessing values by the key, some are optimized to store time-series data, and some store large blobs efficiently. To understand differences and draw distinctions, we start with a short classification and overview, as this helps us to understand the scope of further discussions.

如果没有完整的上下文,术语有时可能会含糊不清且难以理解。例如,区别列存储宽列存储之间的关系很少或根本没有关系,或者聚集索引非聚集索引与索引组织表的关系。本章旨在消除这些术语的歧义并找到它们的准确定义。

Terminology can sometimes be ambiguous and hard to understand without a complete context. For example, distinctions between column and wide column stores that have little or nothing to do with each other, or how clustered and nonclustered indexes relate to index-organized tables. This chapter aims to disambiguate these terms and find their precise definitions.

我们首先概述数据库管理系统架构(参见“DBMS 架构”),并讨论系统组件及其职责。之后,我们讨论数据库管理系统在存储介质(参见“基于内存的 DBMS 与基于磁盘的 DBMS”)和布局(参见“面向列的 DBMS 与面向行的 DBMS”)方面的区别。

We start with an overview of database management system architecture (see “DBMS Architecture”), and discuss system components and their responsibilities. After that, we discuss the distinctions among the database management systems in terms of a storage medium (see “Memory- Versus Disk-Based DBMS”), and layout (see “Column- Versus Row-Oriented DBMS”).

这两个组没有提供数据库管理系统的完整分类,并且还有许多其他分类方法。例如,一些资料来源将 DBMS 分为三大类:

These two groups do not present a full taxonomy of database management systems and there are many other ways they’re classified. For example, some sources group DBMSs into three major categories:

联机事务处理 (OLTP) 数据库
Online transaction processing (OLTP) databases

这些处理大量面向用户的请求和事务。查询通常是预定义的且短暂的。

These handle a large number of user-facing requests and transactions. Queries are often predefined and short-lived.

在线分析处理 (OLAP) 数据库
Online analytical processing (OLAP) databases

这些处理复杂的聚合。OLAP 数据库通常用于分析和数据仓库,并且能够处理复杂、长时间运行的即席查询。

These handle complex aggregations. OLAP databases are often used for analytics and data warehousing, and are capable of handling complex, long-running ad hoc queries.

混合事务和分析处理 (HTAP)
Hybrid transactional and analytical processing (HTAP)

这些数据库结合了 OLTP 和 OLAP 存储的属性。

These databases combine properties of both OLTP and OLAP stores.

还有许多其他术语和分类:键值存储、关系数据库、面向文档的存储和图形数据库。此处未定义这些概念,因为假定读者对其功能具有高级知识和理解。因为我们在这里讨论的概念广泛适用,并且在大多数提到的类型的商店中都以某种方式使用,所以完整的分类对于进一步的讨论来说没有必要或不重要。

There are many other terms and classifications: key-value stores, relational databases, document-oriented stores, and graph databases. These concepts are not defined here, since the reader is assumed to have a high-level knowledge and understanding of their functionality. Because the concepts we discuss here are widely applicable and are used in most of the mentioned types of stores in some capacity, complete taxonomy is not necessary or important for further discussion.

由于本书第一部分重点介绍存储和索引结构,因此我们需要了解高级数据组织方法以及数据和索引文件之间的关系(请参阅“数据文件和索引文件”

Since Part I of this book focuses on the storage and indexing structures, we need to understand the high-level data organization approaches, and the relationship between the data and index files (see “Data Files and Index Files”).

最后,在“缓冲、不变性和排序”中,我们讨论了三种广泛用于开发高效存储结构的技术,以及应用这些技术如何影响它们的设计和实现。

Finally, in “Buffering, Immutability, and Ordering”, we discuss three techniques widely used to develop efficient storage structures and how applying these techniques influences their design and implementation.

数据库管理系统架构

DBMS Architecture

没有数据库管理系统设计的通用蓝图。每个数据库的构建方式都略有不同,并且组件边界有些难以查看和定义。即使这些边界存在于纸上(例如,在项目文档中),在代码中,看似独立的组件也可能由于性能优化、处理边缘情况或架构决策而耦合。

There’s no common blueprint for database management system design. Every database is built slightly differently, and component boundaries are somewhat hard to see and define. Even if these boundaries exist on paper (e.g., in project documentation), in code seemingly independent components may be coupled because of performance optimizations, handling edge cases, or architectural decisions.

描述数据库管理系统体系结构的来源(例如,[HELLERSTEIN07][WEIKUM01][ELMASRI11][MOLINA08])以不同方式定义组件及其之间的关系。图 1-1中呈现的架构演示了这些表示中的一些常见主题。

Sources that describe database management system architecture (for example, [HELLERSTEIN07], [WEIKUM01], [ELMASRI11], and [MOLINA08]), define components and relationships between them differently. The architecture presented in Figure 1-1 demonstrates some of the common themes in these representations.

数据库管理系统使用客户端/服务器模型,其中数据库系统实例(节点)充当服务器的角色,应用程序实例充当客户端的角色。

Database management systems use a client/server model, where database system instances (nodes) take the role of servers, and application instances take the role of clients.

客户要求通过运输子系统到达。请求以查询的形式出现,通常以某种查询语言表达。传输子系统还负责与数据库集群中的其他节点进行通信。

Client requests arrive through the transport subsystem. Requests come in the form of queries, most often expressed in some query language. The transport subsystem is also responsible for communication with other nodes in the database cluster.

数据库0101
图 1-1。数据库管理系统的体系结构

之上收到收据后,传输子系统将查询交给查询处理器,由查询处理器解析、解释和验证该查询。随后,执行访问控制检查,因为只有在解释查询后才能完全完成这些检查。

Upon receipt, the transport subsystem hands the query over to a query processor, which parses, interprets, and validates it. Later, access control checks are performed, as they can be done fully only after the query is interpreted.

解析后的查询被传递给查询优化器,查询优化器首先消除查询中不可能和冗余的部分,然后尝试根据内部统计信息(索引基数、近似交集大小等)和数据放置找到最有效的执行方式(集群中的哪些节点保存数据以及与其传输相关的成本)。优化器处理查询解析所需的关系操作(通常表示为依赖树)和优化,例如索引排序、基数估计和选择访问方法。

The parsed query is passed to the query optimizer, which first eliminates impossible and redundant parts of the query, and then attempts to find the most efficient way to execute it based on internal statistics (index cardinality, approximate intersection size, etc.) and data placement (which nodes in the cluster hold the data and the costs associated with its transfer). The optimizer handles both relational operations required for query resolution, usually presented as a dependency tree, and optimizations, such as index ordering, cardinality estimation, and choosing access methods.

查询通常以执行计划(或查询计划)的形式呈现:必须执行一系列操作才能将其结果视为完整。由于可以使用效率不同的不同执行计划来满足相同的查询,因此优化器会选择最佳的可用计划。

The query is usually presented in the form of an execution plan (or query plan): a sequence of operations that have to be carried out for its results to be considered complete. Since the same query can be satisfied using different execution plans that can vary in efficiency, the optimizer picks the best available plan.

执行计划由执行引擎处理,它收集本地和远程操作的执行结果。远程执行可以涉及向集群中的其他节点写入和读取数据以及复制。

The execution plan is handled by the execution engine, which collects the results of the execution of local and remote operations. Remote execution can involve writing and reading data to and from other nodes in the cluster, and replication.

本地查询(直接来自客户端或其他节点)由存储引擎执行。存储引擎有几个具有专门职责的组件:

Local queries (coming directly from clients or from other nodes) are executed by the storage engine. The storage engine has several components with dedicated responsibilities:

交易经理
Transaction manager

管理器调度事务并确保它们不会使数据库处于逻辑不一致的状态。

This manager schedules transactions and ensures they cannot leave the database in a logically inconsistent state.

锁管理器
Lock manager

管理器锁定正在运行的事务的数据库对象,确保并发操作不会违反物理数据完整性。

This manager locks on the database objects for the running transactions, ensuring that concurrent operations do not violate physical data integrity.

访问方法(存储结构)
Access methods (storage structures)

这些管理访问并组织磁盘上的数据。访问方法包括堆文件和存储结构,例如 B 树(请参阅“无处不在的 B 树”)或 LSM 树(请参阅“LSM 树”)。

These manage access and organizing data on disk. Access methods include heap files and storage structures such as B-Trees (see “Ubiquitous B-Trees”) or LSM Trees (see “LSM Trees”).

缓冲区管理器
Buffer manager

管理器将数据页缓存在内存中(请参阅“缓冲区管理”)。

This manager caches data pages in memory (see “Buffer Management”).

恢复管理器
Recovery manager

manager维护操作日志并在发生故障时恢复系统状态(参见“恢复”)。

This manager maintains the operation log and restoring the system state in case of a failure (see “Recovery”).

事务管理器和锁管理器共同负责并发控制(请参阅“并发控制”):它们保证逻辑和物理数据的完整性,同时确保并发操作尽可能高效地执行。

Together, transaction and lock managers are responsible for concurrency control (see “Concurrency Control”): they guarantee logical and physical data integrity while ensuring that concurrent operations are executed as efficiently as possible.

内存与基于磁盘的 DBMS

Memory- Versus Disk-Based DBMS

数据库系统将数据存储在内存和磁盘上。内存数据库管理系统(有时称为主内存 DBMS主要将数据存储在内存中,并使用磁盘进行恢复和日志记录。基于磁盘的DBMS 将大部分数据保存在磁盘上,并使用内存来缓存磁盘内容或作为临时存储。两种类型的系统都在一定程度上使用磁盘,但主内存数据库几乎完全将其内容存储在 RAM 中。

Database systems store data in memory and on disk. In-memory database management systems (sometimes called main memory DBMS) store data primarily in memory and use the disk for recovery and logging. Disk-based DBMS hold most of the data on disk and use memory for caching disk contents or as a temporary storage. Both types of systems use the disk to a certain extent, but main memory databases store their contents almost exclusively in RAM.

访问内存一直比访问磁盘快几个数量级,1因此使用内存作为主存储是很有吸引力的,而且随着内存价格的下降,这样做在经济上也变得更加可行。然而,与 SSD 和 HDD 等持久存储设备相比,RAM 价格仍然很高。

Accessing memory has been and remains several orders of magnitude faster than accessing disk,1 so it is compelling to use memory as the primary storage, and it becomes more economically feasible to do so as memory prices go down. However, RAM prices still remain high compared to persistent storage devices such as SSDs and HDDs.

主存数据库系统与基于磁盘的数据库系统的不同之处不仅在于主存储介质,而且还在于它们使用的数据结构、组织和优化技术。

Main memory database systems are different from their disk-based counterparts not only in terms of a primary storage medium, but also in which data structures, organization, and optimization techniques they use.

使用内存作为主要数据存储的数据库这样做主要是因为性能、相对较低的访问成本和访问粒度。主内存的编程也比磁盘的编程简单得多。操作系统抽象了内存管理,并允许我们考虑分配和释放任意大小的内存块。在磁盘上,我们必须手动管理数据引用、序列化格式、释放的内存和碎片。

Databases using memory as a primary data store do this mainly because of performance, comparatively low access costs, and access granularity. Programming for main memory is also significantly simpler than doing so for the disk. Operating systems abstract memory management and allow us to think in terms of allocating and freeing arbitrarily sized memory chunks. On disk, we have to manage data references, serialization formats, freed memory, and fragmentation manually.

内存数据库增长的主要限制因素是 RAM 波动性(换句话说,缺乏持久性)和成本。由于 RAM 内容不是持久性的,因此软件错误、崩溃、硬件故障和断电可能会导致数据丢失。有多种方法可以确保耐用性,例如不间断电源和电池支持的 RAM,但它们需要额外的硬件资源和操作专业知识。实际上,这一切都归结为磁盘更易于维护并且价格显着降低。

The main limiting factors on the growth of in-memory databases are RAM volatility (in other words, lack of durability) and costs. Since RAM contents are not persistent, software errors, crashes, hardware failures, and power outages can result in data loss. There are ways to ensure durability, such as uninterrupted power supplies and battery-backed RAM, but they require additional hardware resources and operational expertise. In practice, it all comes down to the fact that disks are easier to maintain and have significantly lower prices.

随着可用性和受欢迎程度,情况可能会改变非易失性内存 (NVM) [ARULRAJ17]技术不断发展。NVM存储减少或完全消除(取决于具体技术)读写延迟之间的不对称性,进一步提高读写性能,并允许字节寻址访问。

The situation is likely to change as the availability and popularity of Non-Volatile Memory (NVM) [ARULRAJ17] technologies grow. NVM storage reduces or completely eliminates (depending on the exact technology) asymmetry between read and write latencies, further improves read and write performance, and allows byte-addressable access.

基于内存的存储的持久性

Durability in Memory-Based Stores

内存数据库系统在磁盘上维护备份以提供持久性并防止易失性数据丢失。有些数据库将数据专门存储在内存中,没有任何持久性保证,但我们不在本书的范围内讨论它们。

In-memory database systems maintain backups on disk to provide durability and prevent loss of the volatile data. Some databases store data exclusively in memory, without any durability guarantees, but we do not discuss them in the scope of this book.

在认为操作完成之前,必须将其结果写入顺序日志文件。我们在“恢复”中更详细地讨论预写日志。为了避免在启动期间或崩溃后重播完整的日志内容,内存中存储维护备份副本。备份副本被维护为排序的基于磁盘的结构,对此结构的修改通常是异步的(与客户端请求解耦)并批量应用,以减少 I/O 操作的数量。在恢复期间,可以从备份和日志中恢复数据库内容。

Before the operation can be considered complete, its results have to be written to a sequential log file. We discuss write-ahead logs in more detail in “Recovery”. To avoid replaying complete log contents during startup or after a crash, in-memory stores maintain a backup copy. The backup copy is maintained as a sorted disk-based structure, and modifications to this structure are often asynchronous (decoupled from client requests) and applied in batches to reduce the number of I/O operations. During recovery, database contents can be restored from the backup and logs.

日志记录通常应用于批量备份。处理完这批日志记录后,备份会保存特定时间点的数据库快照,直到该时间点的日志内容都可以被丢弃。这个过程称为检查点。它通过使磁盘驻留数据库保持最新的日志条目来减少恢复时间,而无需客户端阻塞直到备份更新。

Log records are usually applied to backup in batches. After the batch of log records is processed, backup holds a database snapshot for a specific point in time, and log contents up to this point can be discarded. This process is called checkpointing. It reduces recovery times by keeping the disk-resident database most up-to-date with log entries without requiring clients to block until the backup is updated.

笔记

说内存数据库相当于具有巨大页面缓存的磁盘数据库(参见“缓冲区管理”)是不公平的。即使页面缓存在内存中,序列化格式和数据布局会产生额外的开销,并且不允许实现与内存存储相同程度的优化

It is unfair to say that the in-memory database is the equivalent of an on-disk database with a huge page cache (see “Buffer Management”). Even though pages are cached in memory, serialization format and data layout incur additional overhead and do not permit the same degree of optimization that in-memory stores can achieve.

基于磁盘的数据库使用专门的存储结构,针对磁盘访问进行了优化。在内存中,指针的跟随速度相对较快,并且随机内存访问比随机磁盘访问要快得多。基于磁盘的存储结构通常具有宽而短的树形式(请参阅“基于磁盘的存储的树”),而基于内存的实现可以从更大的数据结构池中进行选择,并执行原本不可能或困难的优化在磁盘上实现[MOLINA92]。同样,处理磁盘上的可变大小数据需要特别注意,而在内存中通常需要使用指针引用值。

Disk-based databases use specialized storage structures, optimized for disk access. In memory, pointers can be followed comparatively quickly, and random memory access is significantly faster than the random disk access. Disk-based storage structures often have a form of wide and short trees (see “Trees for Disk-Based Storage”), while memory-based implementations can choose from a larger pool of data structures and perform optimizations that would otherwise be impossible or difficult to implement on disk [MOLINA92]. Similarly, handling variable-size data on disk requires special attention, while in memory it’s often a matter of referencing the value with a pointer.

对于某些用例,可以合理地假设整个数据集适合内存。某些数据集受其现实世界表示的限制,例如学校的学生记录、公司的客户记录或在线商店中的库存。每条记录占用的空间不超过几Kb,且数量有限。

For some use cases, it is reasonable to assume that an entire dataset is going to fit in memory. Some datasets are bounded by their real-world representations, such as student records for schools, customer records for corporations, or inventory in an online store. Each record takes up not more than a few Kb, and their number is limited.

列式 DBMS 与行式 DBMS

Column- Versus Row-Oriented DBMS

最多数据库系统存储一组数据记录,由中的组成。字段是列和行的交集:某种类型的单个值。属于同一列的字段通常具有相同的数据类型。例如,如果我们定义一个保存用户记录的表,则所有名称都将具有相同的类型并属于同一列。逻辑上属于同一记录(通常由键标识)的值的集合构成一行。

Most database systems store a set of data records, consisting of columns and rows in tables. Field is an intersection of a column and a row: a single value of some type. Fields belonging to the same column usually have the same data type. For example, if we define a table holding user records, all names would be of the same type and belong to the same column. A collection of values that belong logically to the same record (usually identified by the key) constitutes a row.

对数据库进行分类的方法之一是根据数据在磁盘上的存储方式:按行或按列。表可以水平分区(将属于同一行的值存储在一起),也可以垂直分区(将属于同一列的值存储在一起)。图 1-2描述了这种区别:(a) 显示按列分区的值,(b) 显示按行分区的值。

One of the ways to classify databases is by how the data is stored on disk: row- or column-wise. Tables can be partitioned either horizontally (storing values belonging to the same row together), or vertically (storing values belonging to the same column together). Figure 1-2 depicts this distinction: (a) shows the values partitioned column-wise, and (b) shows the values partitioned row-wise.

数据库0102
图 1-2。面向列和行的存储中的数据布局

面向行的数据库管理系统的例子很多:MySQLPostgreSQL以及大多数传统的关系数据库。两位开源先驱面向列的存储有MonetDBC-Store (C-Store 是Vertica的开源前身)。

Examples of row-oriented database management systems are abundant: MySQL, PostgreSQL, and most of the traditional relational databases. The two pioneer open source column-oriented stores are MonetDB and C-Store (C-Store is an open source predecessor to Vertica).

面向行的数据布局

Row-Oriented Data Layout

面向行的数据库管理系统将数据存储在记录或中。它们的布局非常接近表格数据表示,其中每行都有相同的字段集。例如,面向行的数据库可以有效地存储用户条目、持有者姓名、出生日期和电话号码:

Row-oriented database management systems store data in records or rows. Their layout is quite close to the tabular data representation, where every row has the same set of fields. For example, a row-oriented database can efficiently store user entries, holding names, birth dates, and phone numbers:

| 身份证 | 名称 | 出生日期 | 电话号码 |
| 10 | 10 约翰 | 1981 年 8 月 1 日 | +1 111 222 333 |
| 20 | 山姆 | 1988 年 9 月 14 日 | +1 555 888 999 |
| 30| 基思| 1984 年 1 月 7 日 | +1 333 444 555 |
| ID | Name  | Birth Date  | Phone Number   |
| 10 | John  | 01 Aug 1981 | +1 111 222 333 |
| 20 | Sam   | 14 Sep 1988 | +1 555 888 999 |
| 30 | Keith | 07 Jan 1984 | +1 333 444 555 |

这种方法适用于由多个字段构成由键(在本例中为单调递增数字)唯一标识的记录(姓名、出生日期和电话号码)的情况。代表单个用户记录的所有字段通常一起读取。创建记录时(例如,当用户填写注册表时),我们也将它们写在一起。同时,每个字段都可以单独修改。

This approach works well for cases where several fields constitute the record (name, birth date, and a phone number) uniquely identified by the key (in this example, a monotonically incremented number). All fields representing a single user record are often read together. When creating records (for example, when the user fills out a registration form), we write them together as well. At the same time, each field can be modified individually.

由于面向行的存储在我们必须按行访问数据的情况下最有用,因此将整行存储在一起可以提高空间局部性2 [DENNING68]

Since row-oriented stores are most useful in scenarios when we have to access data by row, storing entire rows together improves spatial locality2 [DENNING68].

由于磁盘等持久介质上的数据通常是按块访问的(换句话说,磁盘访问的最小单位是块),因此单个块将包含所有列的数据。这对于我们想要访问整个用户记录的情况非常有用,但会使访问多个用户记录的各个字段的查询(例如,仅获取电话号码的查询)更加昂贵,因为其他字段的数据将被分页也在。

Because data on a persistent medium such as a disk is typically accessed block-wise (in other words, a minimal unit of disk access is a block), a single block will contain data for all columns. This is great for cases when we’d like to access an entire user record, but makes queries accessing individual fields of multiple user records (for example, queries fetching only the phone numbers) more expensive, since data for the other fields will be paged in as well.

面向列的数据布局

Column-Oriented Data Layout

面向列的数据库管理系统垂直(即按列)划分数据,而不是将其存储在行中。这里,同一列的值连续存储在磁盘上(与前面示例中连续存储行相反)。例如,如果我们存储历史股票市场价格,则报价会存储在一起。将不同列的值存储在单独的文件或文件段中可以实现按列的高效查询,因为可以一次性读取它们,而不是消耗整行并丢弃未查询的列的数据。

Column-oriented database management systems partition data vertically (i.e., by column) instead of storing it in rows. Here, values for the same column are stored contiguously on disk (as opposed to storing rows contiguously as in the previous example). For example, if we store historical stock market prices, price quotes are stored together. Storing values for different columns in separate files or file segments allows efficient queries by column, since they can be read in one pass rather than consuming entire rows and discarding data for columns that weren’t queried.

面向列的存储非常适合计算聚合的分析工作负载,例如查找趋势、计算平均值等。处理复杂聚合可用于逻辑记录具有多个字段的情况,但其中一些字段(在本例中,价格报价)具有不同的重要性,并且经常一起使用。

Column-oriented stores are a good fit for analytical workloads that compute aggregates, such as finding trends, computing average values, etc. Processing complex aggregates can be used in cases when logical records have multiple fields, but some of them (in this case, price quotes) have different importance and are often consumed together.

从逻辑上看,代表股市价格行情的数据仍然可以用表格来表示:

From a logical perspective, the data representing stock market price quotes can still be expressed as a table:

| 身份证 | 符号| 日期 | 价格|
| 1 | 陶氏化学 | 2018 年 8 月 8 日 | 24,314.65 |
| 2 | 陶氏化学 | 2018 年 8 月 9 日 | 24,136.16 | 24,136.16
| 3 | 标准普尔 | 2018 年 8 月 8 日 | 2,414.45 | 2,414.45
| 4 | 标准普尔 | 2018 年 8 月 9 日 | 2,232.32 | 2,232.32
| ID | Symbol | Date        | Price     |
| 1  | DOW    | 08 Aug 2018 | 24,314.65 |
| 2  | DOW    | 09 Aug 2018 | 24,136.16 |
| 3  | S&P    | 08 Aug 2018 | 2,414.45  |
| 4  | S&P    | 09 Aug 2018 | 2,232.32  |

然而,基于列的物理数据库布局看起来完全不同。属于同一行的值紧密存储在一起:

However, the physical column-based database layout looks entirely different. Values belonging to the same row are stored closely together:

符号:1:DOW;2:陶氏化学;3:标准普尔;4:标准普尔
日期:2018年8月1:08;2018 年 8 月 2:09;2018 年 8 月 3:08;2018 年 8 月 4:09
价格:1:24,314.65;2:24,136.16;3:2,414.45;4:2,232.32
Symbol: 1:DOW; 2:DOW; 3:S&P; 4:S&P
Date:   1:08 Aug 2018; 2:09 Aug 2018; 3:08 Aug 2018; 4:09 Aug 2018
Price:  1:24,314.65; 2:24,136.16; 3:2,414.45; 4:2,232.32

为了重建数据元组(这可能对连接、过滤和多行聚合有用),我们需要在列级别保留一些元数据,以识别与其他列相关的数据点。如果显式执行此操作,每个值都必须保存一个键,这会引入重复并增加存储的数据量。一些列存储使用隐式标识符(虚拟 ID),并使用值的位置(换句话说,它的偏移量)将其映射回相关值[ABADI13]

To reconstruct data tuples, which might be useful for joins, filtering, and multirow aggregates, we need to preserve some metadata on the column level to identify which data points from other columns it is associated with. If you do this explicitly, each value will have to hold a key, which introduces duplication and increases the amount of stored data. Some column stores use implicit identifiers (virtual IDs) instead and use the position of the value (in other words, its offset) to map it back to the related values [ABADI13].

在过去的几年中,可能是由于对不断增长的数据集运行复杂的分析查询的需求不断增长,我们已经看到许多新的面向列的文件格式,例如Apache ParquetApache ORCRCFile,以及面向列的存储,例如Apache KuduClickHouse[ROY12]

During the last several years, likely due to a rising demand to run complex analytical queries over growing datasets, we’ve seen many new column-oriented file formats such as Apache Parquet, Apache ORC, RCFile, as well as column-oriented stores, such as Apache Kudu, ClickHouse, and many others [ROY12].

区别与优化

Distinctions and Optimizations

仅仅说行存储和列存储之间的区别仅在于数据的存储方式是不够的。选择数据布局只是列式存储所针对的一系列可能优化的步骤之一。

It is not sufficient to say that distinctions between row and column stores are only in the way the data is stored. Choosing the data layout is just one of the steps in a series of possible optimizations that columnar stores are targeting.

在一次运行中读取同一列的多个值可显着提高缓存利用率和计算效率。在现代 CPU 上,矢量化指令可用于通过单个 CPU 指令3 [DREPPER07]处理多个数据点。

Reading multiple values for the same column in one run significantly improves cache utilization and computational efficiency. On modern CPUs, vectorized instructions can be used to process multiple data points with a single CPU instruction3 [DREPPER07].

将具有相同数据类型的值存储在一起(例如,数字与其他数字、字符串与其他字符串)可提供更好的压缩比。我们可以根据数据类型使用不同的压缩算法,并针对每种情况选择最有效的压缩方法。

Storing values that have the same data type together (e.g., numbers with other numbers, strings with other strings) offers a better compression ratio. We can use different compression algorithms depending on the data type and pick the most effective compression method for each case.

决定是使用面向列的存储还是面向行的存储,您需要了解您的访问模式。如果读取的数据消耗在记录中(即请求大多数或所有列)并且工作负载主要由点查询和范围扫描组成,则面向行的方法可能会产生更好的结果。如果扫描跨越许多行,或者计算列子集上的聚合,则值得考虑面向列的方法。

To decide whether to use a column- or a row-oriented store, you need to understand your access patterns. If the read data is consumed in records (i.e., most or all of the columns are requested) and the workload consists mostly of point queries and range scans, the row-oriented approach is likely to yield better results. If scans span many rows, or compute aggregate over a subset of columns, it is worth considering a column-oriented approach.

宽列存储

Wide Column Stores

面向列的数据库不应与宽列存储混淆,例如BigTableHBase,其中数据表示为多维映射,列被分组为 (通常存储相同类型的数据),并且在每个列族内部存储数据逐行。这种布局最适合存储通过一个键或一系列键检索的数据。

Column-oriented databases should not be mixed up with wide column stores, such as BigTable or HBase, where data is represented as a multidimensional map, columns are grouped into column families (usually storing data of the same type), and inside each column family, data is stored row-wise. This layout is best for storing data retrieved by a key or a sequence of keys.

Bigtable 论文[CHANG06]中的一个典型示例是 Webtable。Webtable 存储特定时间戳的网页内容、其属性以及它们之间的关系的快照。页面由反向 URL 标识,所有属性(例如页面内容锚点,表示页面之间的链接)由拍摄这些快照的时间戳标识。简单地说,它可以表示为嵌套映射,如图1-3所示。

A canonical example from the Bigtable paper [CHANG06] is a Webtable. A Webtable stores snapshots of web page contents, their attributes, and the relations among them at a specific timestamp. Pages are identified by the reversed URL, and all attributes (such as page content and anchors, representing links between pages) are identified by the timestamps at which these snapshots were taken. In a simplified way, it can be represented as a nested map, as Figure 1-3 shows.

数据库0103
图 1-3。Webtable 的概念结构

数据存储在具有分层索引的多维排序映射中:我们可以通过反向 URL 定位与特定网页相关的数据,并通过时间戳定位其内容或锚点。每一行由其行键索引。相关列在列族中分组在一起(contentsanchor本例中),它们分别存储在磁盘上。列族中的每一列由列键标识,列键是列族名称和限定符(本例中为 , )html的组合。列族按时间戳存储多个版本的数据。这种布局使我们能够快速找到更高级别的条目(在本例中为网页)及其参数(内容版本和其他页面的链接)。cnnsi.commy.look.ca

Data is stored in a multidimensional sorted map with hierarchical indexes: we can locate the data related to a specific web page by its reversed URL and its contents or anchors by the timestamp. Each row is indexed by its row key. Related columns are grouped together in column familiescontents and anchor in this example—which are stored on disk separately. Each column inside a column family is identified by the column key, which is a combination of the column family name and a qualifier (html, cnnsi.com, my.look.ca in this example). Column families store multiple versions of data by timestamp. This layout allows us to quickly locate the higher-level entries (web pages, in this case) and their parameters (versions of content and links to the other pages).

虽然理解宽列存储的概念表示很有用,但它们的物理布局有些不同。列族中数据布局的示意图如图1-4所示:列族是分开存储的,但在每个列族中,属于同一键的数据存储在一起。

While it is useful to understand the conceptual representation of wide column stores, their physical layout is somewhat different. A schematic representation of the data layout in column families is shown in Figure 1-4: column families are stored separately, but in each column family, the data belonging to the same key is stored together.

数据库0104
图 1-4。网络表的物理结构

数据文件和索引文件

Data Files and Index Files

数据库系统的主要目标是存储数据并允许快速访问数据。但数据是如何组织的呢?为什么我们需要一个数据库管理系统而不仅仅是一堆文件?文件整理如何提高效率?

The primary goal of a database system is to store data and to allow quick access to it. But how is the data organized? Why do we need a database management system and not just a bunch of files? How does file organization improve efficiency?

数据库系统确实使用文件来存储数据,但它们不依赖于目录和文件的文件系统层次结构来定位记录,而是使用特定于实现的格式来组成文件。使用专门的文件组织而不是平面文件的主要原因是:

Database systems do use files for storing the data, but instead of relying on filesystem hierarchies of directories and files for locating records, they compose files using implementation-specific formats. The main reasons to use specialized file organization over flat files are:

存储效率
Storage efficiency

文件的组织方式可以最大限度地减少每个存储数据记录的存储开销。

Files are organized in a way that minimizes storage overhead per stored data record.

访问效率
Access efficiency

可以以尽可能少的步骤来定位记录。

Records can be located in the smallest possible number of steps.

更新效率
Update efficiency

记录更新的执行方式可以最大限度地减少磁盘上的更改数量。

Record updates are performed in a way that minimizes the number of changes on disk.

数据库系统将由多个字段组成的数据记录存储在表中,其中每个表通常表示为单独的文件。表中的每条记录都可以查看使用搜索键查找。为了定位一条记录,数据库系统使用索引辅助数据结构,使其能够有效地定位数据记录,而无需在每次访问时扫描整个表。索引是使用标识记录的字段子集构建的。

Database systems store data records, consisting of multiple fields, in tables, where each table is usually represented as a separate file. Each record in the table can be looked up using a search key. To locate a record, database systems use indexes: auxiliary data structures that allow it to efficiently locate data records without scanning an entire table on every access. Indexes are built using a subset of fields identifying the record.

数据库系统通常将数据文件索引文件分开:数据文件存储数据记录,而索引文件存储记录元数据并使用它来定位数据文件中的记录。索引文件通常小于数据文件。文件被分区为,通常具有单个或多个磁盘块的大小。页面可以组织为记录序列或作为开槽页面(请参阅“开槽页面”)。

A database system usually separates data files and index files: data files store data records, while index files store record metadata and use it to locate records in data files. Index files are typically smaller than the data files. Files are partitioned into pages, which typically have the size of a single or multiple disk blocks. Pages can be organized as sequences of records or as a slotted pages (see “Slotted Pages”).

新记录(插入)和对现有记录的更新由键/值对表示。大多数现代存储系统不会显式删除页面中的数据。相反,他们使用删除标记(也称为逻辑删除),其中包含删除元数据,例如密钥和时间戳。占用空间被更新或删除标记遮蔽的记录在垃圾收集期间被回收,垃圾收集读取页面,将活动(即非遮蔽)记录写入新位置,并丢弃遮蔽记录

New records (insertions) and updates to the existing records are represented by key/value pairs. Most modern storage systems do not delete data from pages explicitly. Instead, they use deletion markers (also called tombstones), which contain deletion metadata, such as a key and a timestamp. Space occupied by the records shadowed by their updates or deletion markers is reclaimed during garbage collection, which reads the pages, writes the live (i.e., nonshadowed) records to the new place, and discards the shadowed ones.

数据文件

Data Files

数据文件(有时称为主文件)可以实现为索引组织表(IOT)、堆组织表(堆文件)或散列组织表(散列文件)。

Data files (sometimes called primary files) can be implemented as index-organized tables (IOT), heap-organized tables (heap files), or hash-organized tables (hashed files).

堆文件中的记录不需要遵循任何特定的顺序,大多数情况下它们是按写入顺序放置的。这样,附加新页面时不需要额外的工作或文件重组。堆文件需要额外的索引结构,指向存储数据记录的位置,以使它们可搜索。

Records in heap files are not required to follow any particular order, and most of the time they are placed in a write order. This way, no additional work or file reorganization is required when new pages are appended. Heap files require additional index structures, pointing to the locations where data records are stored, to make them searchable.

在哈希文件中,记录存储在桶中,键的哈希值决定记录属于哪个桶。桶中的记录可以按附加顺序存储,也可以按键排序,以提高查找速度。

In hashed files, records are stored in buckets, and the hash value of the key determines which bucket a record belongs to. Records in the bucket can be stored in append order or sorted by key to improve lookup speed.

索引组织表 (IOT) 将数据记录存储在索引本身中。由于记录是按键顺序存储的,因此物联网中的范围扫描可以通过顺序扫描其内容来实现。

Index-organized tables (IOTs) store data records in the index itself. Since records are stored in key order, range scans in IOTs can be implemented by sequentially scanning its contents.

将数据记录存储在索引中可以使我们至少减少一次磁盘查找次数,因为在遍历索引并定位搜索到的键之后,我们不必寻址单独的文件来查找关联的数据记录。

Storing data records in the index allows us to reduce the number of disk seeks by at least one, since after traversing the index and locating the searched key, we do not have to address a separate file to find the associated data record.

当记录存储在单独的文件中时,索引文件保存数据条目,唯一标识数据记录并包含足够的信息来在数据文件中定位它们。例如,我们可以存储文件偏移量(有时称为行定位器)、数据文件中数据记录的位置,或者哈希文件中的存储桶 ID。在索引组织的表中,数据条目保存实际的数据记录。

When records are stored in a separate file, index files hold data entries, uniquely identifying data records and containing enough information to locate them in the data file. For example, we can store file offsets (sometimes called row locators), locations of data records in the data file, or bucket IDs in the case of hash files. In index-organized tables, data entries hold actual data records.

索引文件

Index Files

一个索引是一种以有利于高效检索操作的方式组织磁盘上数据记录的结构。索引文件被组织为专门的结构,将键映射到数据文件中存储由这些键(在堆文件的情况下)或主键(在索引组织表的情况下)标识的记录的位置。

An index is a structure that organizes data records on disk in a way that facilitates efficient retrieval operations. Index files are organized as specialized structures that map keys to locations in data files where the records identified by these keys (in the case of heap files) or primary keys (in the case of index-organized tables) are stored.

一个(数据)文件上的索引称为主索引。但是,在大多数情况下,我们还可以假设主索引是基于主键或标识为主键的一组键构建的。所有其他索引都称为辅助索引

An index on a primary (data) file is called the primary index. However, in most cases we can also assume that the primary index is built over a primary key or a set of keys identified as primary. All other indexes are called secondary.

二级索引可以直接指向数据记录,也可以简单地存储其主键。指向数据记录的指针可以保存堆文件或索引组织表的偏移量。多个二级索引可以指向同一条记录,从而允许单个数据记录通过不同的字段来标识并通过不同的索引进行定位。虽然主索引文件为每个搜索键保存一个唯一条目,但二级索引可以为每个搜索键保存多个条目[MOLINA08]

Secondary indexes can point directly to the data record, or simply store its primary key. A pointer to a data record can hold an offset to a heap file or an index-organized table. Multiple secondary indexes can point to the same record, allowing a single data record to be identified by different fields and located through different indexes. While primary index files hold a unique entry per search key, secondary indexes may hold several entries per search key [MOLINA08].

如果数据记录的顺序遵循搜索键的顺序,这种索引称为聚簇(也称为聚类)。数据记录在集群情况下,通常存储在同一个文件或集群文件中,其中保留密钥顺序。如果数据存储在单独的文件中,并且其顺序不遵循密钥顺序,则索引称为非聚集索引(有时称为非聚集索引)。

If the order of data records follows the search key order, this index is called clustered (also known as clustering). Data records in the clustered case are usually stored in the same file or in a clustered file, where the key order is preserved. If the data is stored in a separate file, and its order does not follow the key order, the index is called nonclustered (sometimes called unclustered).

图 1-5显示了两种方法之间的差异:

Figure 1-5 shows the difference between the two approaches:

  • a) 两个索引直接从二级索引文件引用数据条目。

  • a) Two indexes reference data entries directly from secondary index files.

  • b) 二级索引通过主索引的间接层来定位数据条目。

  • b) A secondary index goes through the indirection layer of a primary index to locate the data entries.

数据库0105
图 1-5。将数据记录存储在索引文件中与存储数据文件的偏移量(索引段以白色显示;保存数据记录的段以灰色显示)
笔记

索引组织表按索引顺序存储信息,并按定义聚集。主索引通常是聚集的。根据定义,二级索引是非聚集索引,因为它们用于方便主索引以外的键进行访问。聚集索引可以是索引组织的,也可以具有单独的索引和数据文件。

Index-organized tables store information in index order and are clustered by definition. Primary indexes are most often clustered. Secondary indexes are nonclustered by definition, since they’re used to facilitate access by keys other than the primary one. Clustered indexes can be both index-organized or have separate index and data files.

许多数据库系统有一个固有的和显式的主键,一组唯一标识数据库记录的列。在未指定主键的情况下,存储引擎可以创建隐式主键(例如MySQL InnoDB添加新的自增列并自动填充其值)。

Many database systems have an inherent and explicit primary key, a set of columns that uniquely identify the database record. In cases when the primary key is not specified, the storage engine can create an implicit primary key (for example, MySQL InnoDB adds a new auto-increment column and fills in its values automatically).

这个术语用于不同类型的数据库系统:关系数据库系统(例如 MySQL 和 PostgreSQL)、基于 Dynamo 的 NoSQL 存储(例如Apache CassandraRiak)以及文档存储(例如 MongoDB)。可能有一些特定于项目的命名,但大多数情况下,该术语都有明确的映射。

This terminology is used in different kinds of database systems: relational database systems (such as MySQL and PostgreSQL), Dynamo-based NoSQL stores (such as Apache Cassandra and in Riak), and document stores (such as MongoDB). There can be some project-specific naming, but most often there’s a clear mapping to this terminology.

主索引作为间接索引

Primary Index as an Indirection

对于是否可以,数据库界有不同的意见。数据记录应直接引用(通过文件偏移量)或通过主键索引引用。4

There are different opinions in the database community on whether data records should be referenced directly (through file offset) or via the primary key index.4

两种方法都有其优点和缺点,并且可以在完整实施的范围内更好地讨论。通过直接引用数据,我们可以减少磁盘寻道次数,但在维护过程中每当更新或重新定位记录时都必须付出更新指针的成本。使用主索引形式的间接允许我们减少指针更新的成本,但在读取路径上的成本较高。

Both approaches have their pros and cons and are better discussed in the scope of a complete implementation. By referencing data directly, we can reduce the number of disk seeks, but have to pay a cost of updating the pointers whenever the record is updated or relocated during a maintenance process. Using indirection in the form of a primary index allows us to reduce the cost of pointer updates, but has a higher cost on a read path.

如果工作负载主要由读取组成,则仅更新几个索引可能会起作用,但这种方法不适用于具有多个索引的写入密集型工作负载。为了降低指针更新的成本,一些实现使用主键来代替有效负载偏移量来进行间接寻址。例如,MySQL InnoDB 使用主索引并执行两次查找:执行查询时,一次在辅助索引中查找,一次在主索引中查找[TARIQ11]。这会增加主索引查找的开销,而不是直接跟踪辅助索引的偏移量。

Updating just a couple of indexes might work if the workload mostly consists of reads, but this approach does not work well for write-heavy workloads with multiple indexes. To reduce the costs of pointer updates, instead of payload offsets, some implementations use primary keys for indirection. For example, MySQL InnoDB uses a primary index and performs two lookups: one in the secondary index, and one in a primary index when performing a query [TARIQ11]. This adds an overhead of a primary index lookup instead of following the offset directly from the secondary index.

图 1-6显示了这两种方法的不同之处:

Figure 1-6 shows how the two approaches are different:

  • a) 两个索引直接从二级索引文件引用数据条目。

  • a) Two indexes reference data entries directly from secondary index files.

  • b) 二级索引通过主索引的间接层来定位数据条目。

  • b) A secondary index goes through the indirection layer of a primary index to locate the data entries.

数据库0106
图 1-6。直接引用数据元组 (a) 与使用主索引作为间接引用(b)

还可以使用混合方法并存储数据文件偏移量和主键。首先,检查数据偏移量是否仍然有效,并支付额外的成本来检查主键索引(如果已更改),并在找到新的偏移量后更新索引文件。

It is also possible to use a hybrid approach and store both data file offsets and primary keys. First, you check if the data offset is still valid and pay the extra cost of going through the primary key index if it has changed, updating the index file after finding a new offset.

缓冲、不变性和排序

Buffering, Immutability, and Ordering

A存储引擎是基于某种数据结构的。然而,这些结构没有描述缓存、恢复、事务性以及存储引擎在其之上添加的其他内容的语义。

A storage engine is based on some data structure. However, these structures do not describe the semantics of caching, recovery, transactionality, and other things that storage engines add on top of them.

在接下来的章节中,我们将从 B-Tree 开始讨论(请参阅“无处不在的 B-Tree”),并尝试理解为什么有如此多的 B-Tree 变体,以及为什么新的数据库存储结构不断出现。

In the next chapters, we will start the discussion with B-Trees (see “Ubiquitous B-Trees”) and try to understand why there are so many B-Tree variants, and why new database storage structures keep emerging.

存储结构具有三个常见变量:它们使用缓冲(或避免使用缓冲)、使用不可变(或可变)文件以及按顺序(或无序)存储值。本书讨论的存储结构的大多数区别和优化都与这三个概念之一相关。

Storage structures have three common variables: they use buffering (or avoid using it), use immutable (or mutable) files, and store values in order (or out of order). Most of the distinctions and optimizations in storage structures discussed in this book are related to one of these three concepts.

缓冲
Buffering

这定义了存储结构是否选择在将一定量的数据放入磁盘之前将其收集到内存中。当然,每个磁盘结构都必须在某种程度上使用缓冲,因为进出数据的最小单位磁盘是一个,并且希望写入完整的块。在这里,我们讨论的是可避免的缓冲,这是存储引擎实现者选择做的事情。我们在本书中讨论的首要优化之一是向 B 树节点添加内存缓冲区以分摊 I/O 成本(请参阅“惰性 B 树”)。然而,这并不是我们应用缓冲的唯一方法。例如,双组件 LSM 树(请参阅“双组件 LSM 树”)尽管与 B 树相似,但以完全不同的方式使用缓冲,并将缓冲与不变性结合起来。

This defines whether or not the storage structure chooses to collect a certain amount of data in memory before putting it on disk. Of course, every on-disk structure has to use buffering to some degree, since the smallest unit of data transfer to and from the disk is a block, and it is desirable to write full blocks. Here, we’re talking about avoidable buffering, something storage engine implementers choose to do. One of the first optimizations we discuss in this book is adding in-memory buffers to B-Tree nodes to amortize I/O costs (see “Lazy B-Trees”). However, this is not the only way we can apply buffering. For example, two-component LSM Trees (see “Two-component LSM Tree”), despite their similarities with B-Trees, use buffering in an entirely different way, and combine buffering with immutability.

可变性(或不变性)
Mutability (or immutability)

定义存储结构是否读取文件的部分内容、更新它们并将更新的结果写入文件中的同一位置。不可变的结构仅追加:一旦写入,文件内容就不会被修改。相反,修改会附加到文件末尾。还有其他方法可以实现不变性。一其中,写入时复制(请参阅“写入时复制”),其中保存记录更新版本的修改页被写入文件中的新位置,而不是其原始位置。通常,LSM 和 B-Tree 之间的区别被认为是相对于就地更新存储而言是不可变的,但也有一些结构(例如“Bw-Trees”)受 B-Tree 启发,但却是不可变的。

This defines whether or not the storage structure reads parts of the file, updates them, and writes the updated results at the same location in the file. Immutable structures are append-only: once written, file contents are not modified. Instead, modifications are appended to the end of the file. There are other ways to implement immutability. One of them is copy-on-write (see “Copy-on-Write”), where the modified page, holding the updated version of the record, is written to the new location in the file, instead of its original location. Often the distinction between LSM and B-Trees is drawn as immutable against in-place update storage, but there are structures (for example, “Bw-Trees”) that are inspired by B-Trees but are immutable.

订购
Ordering

定义为数据记录是否按键顺序存储在磁盘上的页中。换句话说,紧密排序的键存储在磁盘上连续的段中。排序通常定义我们是否可以有效地扫描记录范围,而不仅仅是定位单个数据记录。无序存储数据(最常见的是按插入顺序)可以进行一些写入时优化。例如,Bitcask(参见“Bitcask”)和WiscKey(参见“WiscKey”)将数据记录直接存储在仅附加文件中。

This is defined as whether or not the data records are stored in the key order in the pages on disk. In other words, the keys that sort closely are stored in contiguous segments on disk. Ordering often defines whether or not we can efficiently scan the range of records, not only locate the individual data records. Storing data out of order (most often, in insertion order) opens up for some write-time optimizations. For example, Bitcask (see “Bitcask”) and WiscKey (see “WiscKey”) store data records directly in append-only files.

当然,对这三个概念的简短讨论不足以展示它们的力量,我们将在本书的其余部分继续讨论。

Of course, a brief discussion of these three concepts is not enough to show their power, and we’ll continue this discussion throughout the rest of the book.

概括

Summary

在本章中,我们讨论了数据库管理系统的体系结构并涵盖了其主要组件。

In this chapter, we’ve discussed the architecture of a database management system and covered its primary components.

为了强调基于磁盘的结构的重要性及其与内存结构的区别,我们讨论了基于内存和基于磁盘的存储。我们得出的结论是,基于磁盘的结构对于两种类型的存储都很重要,但用途不同。

To highlight the importance of disk-based structures and their difference from in-memory ones, we discussed memory- and disk-based stores. We came to the conclusion that disk-based structures are important for both types of stores, but are used for different purposes.

为了了解访问模式如何影响数据库系统设计,我们讨论了面向列和行的数据库管理系统以及区分它们的主要因素。为了开始讨论数据如何存储,我们介绍了数据和索引文件。

To understand how access patterns influence database system design, we discussed column- and row-oriented database management systems and the primary factors that set them apart from each other. To start a conversation about how the data is stored, we covered data and index files.

最后,我们介绍了三个核心概念:缓冲、不变性和排序。我们将在本书中使用它们来突出使用它们的存储引擎的属性。

Lastly, we introduced three core concepts: buffering, immutability, and ordering. We will use them throughout this book to highlight properties of the storage engines that use them.

1您可以在https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html上找到多年来磁盘、内存访问延迟和许多其他相关数字的可视化和比较。

1 You can find a visualization and comparison of disk, memory access latencies, and many other relevant numbers over the years at https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html.

2空间局部性是局部性原理之一,指出如果访问某个内存位置,则在不久的将来也会访问其附近的内存位置。

2 Spatial locality is one of the Principles of Locality, stating that if a memory location is accessed, its nearby memory locations will be accessed in the near future.

3矢量化指令或单指令多数据 (SIMD) 描述了一类在多个数据点上执行相同操作的 CPU 指令。

3 Vectorized instructions, or Single Instruction Multiple Data (SIMD), describes a class of CPU instructions that perform the same operation on multiple data points.

4最初引起讨论的帖子是有争议的和片面的,但你可以参考 比较 MySQL 和 PostgreSQL 索引和存储格式的演示文稿,该演示文稿也引用了原始来源。

4 The original post that has stirred up the discussion was controversial and one-sided, but you can refer to the presentation comparing MySQL and PostgreSQL index and storage formats, which references the original source as well.

第 2 章B 树基础知识

Chapter 2. B-Tree Basics

在上一章中,我们将存储结构分为两组:可变的不可变的,并将不可变性确定为影响其设计和实现的核心概念之一。大多数可变存储结构使用就地更新机制。在插入、删除或更新操作期间,数据记录直接更新到目标文件中的位置。

In the previous chapter, we separated storage structures in two groups: mutable and immutable ones, and identified immutability as one of the core concepts influencing their design and implementation. Most of the mutable storage structures use an in-place update mechanism. During insert, delete, or update operations, data records are updated directly in their locations in the target file.

存储引擎通常允许数据库中存在同一数据记录的多个版本;例如,当使用多版本并发控制(请参阅“多版本并发控制”)或分槽页面组织(请参阅“分槽页面”)时。为了简单起见,现在我们假设每个键仅与一个数据记录相关联,该数据记录具有唯一的位置。

Storage engines often allow multiple versions of the same data record to be present in the database; for example, when using multiversion concurrency control (see “Multiversion Concurrency Control”) or slotted page organization (see “Slotted Pages”). For the sake of simplicity, for now we assume that each key is associated only with one data record, which has a unique location.

最流行的存储结构之一是 B 树。许多开源数据库系统都是基于 B 树的,多年来它们已被证明可以覆盖大多数用例。

One of the most popular storage structures is a B-Tree. Many open source database systems are B-Tree based, and over the years they’ve proven to cover the majority of use cases.

B 树并不是最近才发明的:它们是由 Rudolph Bayer 和 Edward M. McCreight 于 1971 年引入的,多年来一直受到欢迎。到 1979 年,B 树已经出现了相当多的变体。Douglas Comer 收集并系统化了其中的一些内容[COMER79]

B-Trees are not a recent invention: they were introduced by Rudolph Bayer and Edward M. McCreight back in 1971 and gained popularity over the years. By 1979, there were already quite a few variants of B-Trees. Douglas Comer collected and systematized some of them [COMER79].

在深入研究 B 树之前,我们首先讨论一下为什么我们应该考虑传统搜索树的替代方案,例如二叉搜索树、2-3 树和 AVL 树 [KNUTH98 ]。为此,让我们回顾一下二叉搜索树是什么。

Before we dive into B-Trees, let’s first talk about why we should consider alternatives to traditional search trees, such as, for example, binary search trees, 2-3-Trees, and AVL Trees [KNUTH98]. For that, let’s recall what binary search trees are.

二叉搜索树

Binary Search Trees

叉搜索树(BST)是排序的内存数据结构,用于高效的键值查找。BST 由多个节点组成。每个树节点都由一个键、与该键关联的值以及两个子指针(因此称为二进制)表示。BST 从单个节点开始,称为一个根节点。树上只能有一根根。图 2-1显示了二叉搜索树的示例。

A binary search tree (BST) is a sorted in-memory data structure, used for efficient key-value lookups. BSTs consist of multiple nodes. Each tree node is represented by a key, a value associated with this key, and two child pointers (hence the name binary). BSTs start from a single node, called a root node. There can be only one root in the tree. Figure 2-1 shows an example of a binary search tree.

数据库0201
图 2-1。二叉搜索树

每个节点将搜索空间分为左子树和右子,如图2-2所示:节点键大于其左子树中存储的任何键,并且小于其右子树中存储的任何键[SEDGEWICK11]

Each node splits the search space into left and right subtrees, as Figure 2-2 shows: a node key is greater than any key stored in its left subtree and less than any key stored in its right subtree [SEDGEWICK11].

数据库0202
图 2-2。二叉树节点不变量

沿着左指针从树的根向下到叶级别(节点没有子节点的级别),找到保存树内最小键和与其关联的值的节点。类似地,跟随右指针定位树中保存最大键的节点以及与其关联的值。允许将值存储在树中的所有节点中。搜索从根节点开始,如果在较高级别上找到了搜索到的关键字,则搜索可能会在到达树的底部级别之前终止。

Following left pointers from the root of the tree down to the leaf level (the level where nodes have no children) locates the node holding the smallest key within the tree and a value associated with it. Similarly, following right pointers locates the node holding the largest key within the tree and a value associated with it. Values are allowed to be stored in all nodes in the tree. Searches start from the root node, and may terminate before reaching the bottom level of the tree if the searched key was found on a higher level.

树平衡

Tree Balancing

插入操作不遵循任何特定模式,元素插入可能会导致树不平衡的情况(即,其中一个分支比另一个分支长)。最坏的情况如图 2-3 (b) 所示,我们最终的结果是病态树看起来更像是一个链表,我们得到的不是所需的对数复杂度,而是线性的,如图2-3 (a) 所示。

Insert operations do not follow any specific pattern, and element insertion might lead to the situation where the tree is unbalanced (i.e., one of its branches is longer than the other one). The worst-case scenario is shown in Figure 2-3 (b), where we end up with a pathological tree, which looks more like a linked list, and instead of desired logarithmic complexity, we get linear, as illustrated in Figure 2-3 (a).

数据库0203
图 2-3。平衡 (a) 和不平衡或病态 (b) 树示例

这个例子可能稍微夸大了问题,但它说明了为什么树需要平衡:尽管不太可能所有项目最终都在树的一侧,但至少其中一些肯定会,这会显着减慢速度搜索。

This example might slightly exaggerate the problem, but it illustrates why the tree needs to be balanced: even though it’s somewhat unlikely that all the items end up on one side of the tree, at least some of them certainly will, which will significantly slow down searches.

平衡树_被定义为高度为 的树,其中是树中的项目总数,并且两个子树之间的高度差不大于 1 [ KNUTH98]。如果没有平衡,我们就会失去二叉搜索树结构的性能优势,并允许插入和删除顺序来确定树的形状。log2 NN

The balanced tree is defined as one that has a height of log2 N, where N is the total number of items in the tree, and the difference in height between the two subtrees is not greater than one1 [KNUTH98]. Without balancing, we lose performance benefits of the binary search tree structure, and allow insertions and deletions order to determine tree shape.

在平衡树中,跟随左节点或右节点指针平均将搜索空间减少一半,因此查找复杂度是对数的:。如果树不平衡,最坏情况的复杂度会达到,因为我们最终可能会遇到所有元素都位于树的一侧的情况。O(log2 N)O(N)

In the balanced tree, following the left or right node pointer reduces the search space in half on average, so lookup complexity is logarithmic: O(log2 N). If the tree is not balanced, worst-case complexity goes up to O(N), since we might end up in the situation where all elements end up on one side of the tree.

不是向其中一个树枝添加新元素并使其变长,而另一个树枝保持为空(如图2-3 (b) 所示),而是在每次操作后使树保持平衡。平衡是通过以最小化树高度并将每侧节点数量保持在范围内的方式重新组织节点来完成的。

Instead of adding new elements to one of the tree branches and making it longer, while the other one remains empty (as shown in Figure 2-3 (b)), the tree is balanced after each operation. Balancing is done by reorganizing nodes in a way that minimizes tree height and keeps the number of nodes on each side within bounds.

保持树平衡的方法之一是在添加或删除节点后执行旋转步骤。如果插入操作使分支不平衡(分支中的两个连续节点只有一个子节点),我们可以围绕中间节点旋转节点。在图 2-4所示的示例中 ,在旋转期间,中间节点 (3)(称为旋转枢轴)会提升一级,并且其父节点成为其右子节点。

One of the ways to keep the tree balanced is to perform a rotation step after nodes are added or removed. If the insert operation leaves a branch unbalanced (two consecutive nodes in the branch have only one child), we can rotate nodes around the middle one. In the example shown in Figure 2-4, during rotation the middle node (3), known as a rotation pivot, is promoted one level higher, and its parent becomes its right child.

数据库0204
图 2-4。旋转步骤示例

用于基于磁盘的存储的树

Trees for Disk-Based Storage

作为前面提到,不平衡树的最坏情况复杂度为O(N)。平衡树给我们一个平均值。同时,由于O(log2 N)扇出(扇出是每个节点允许的最大子节点数量),我们必须相当频繁地执行平衡、重新定位节点和更新指针。维护成本的增加使得 BST 作为磁盘数据结构变得不切实际[NIEVERGELT74]

As previously mentioned, unbalanced trees have a worst-case complexity of O(N). Balanced trees give us an average O(log2 N). At the same time, due to low fanout (fanout is the maximum allowed number of children per node), we have to perform balancing, relocate nodes, and update pointers rather frequently. Increased maintenance costs make BSTs impractical as on-disk data structures [NIEVERGELT74].

如果我们想在磁盘上维护 BST,我们将面临几个问题。一个问题是局部性:由于元素是按随机顺序添加的,因此无法保证新创建的节点写入到其父节点附近,这意味着节点子指针可能跨越多个磁盘页面。我们可以通过修改树布局和使用分页二叉树(参见“分页二叉树”)来一定程度地改善这种情况。

If we wanted to maintain a BST on disk, we’d face several problems. One problem is locality: since elements are added in random order, there’s no guarantee that a newly created node is written close to its parent, which means that node child pointers may span across several disk pages. We can improve the situation to a certain extent by modifying the tree layout and using paged binary trees (see “Paged Binary Trees”).

另一个与跟随子指针的成本密切相关的问题是树高。由于二叉树的扇出仅为 2,因此高度是树中元素数量的二进制对数,我们必须执行查找来定位搜索到的元素,然后执行相同数量的磁盘传输。2-3-Tree 和其他低扇出树也有类似的限制:虽然它们可用作内存数据结构,但较小的节点大小使得它们对于外部存储来说不切实际[COMER79]O(log2 N)

Another problem, closely related to the cost of following child pointers, is tree height. Since binary trees have a fanout of just two, height is a binary logarithm of the number of the elements in the tree, and we have to perform O(log2 N) seeks to locate the searched element and, subsequently, perform the same number of disk transfers. 2-3-Trees and other low-fanout trees have a similar limitation: while they are useful as in-memory data structures, small node size makes them impractical for external storage [COMER79].

简单的磁盘 BST 实现需要与比较一样多的磁盘查找,因为没有内置的局部性概念。这让我们开始寻找能够表现出这种特性的数据结构。

A naive on-disk BST implementation would require as many disk seeks as comparisons, since there’s no built-in concept of locality. This sets us on a course to look for a data structure that would exhibit this property.

考虑到这些因素,更适合磁盘实现的树版本必须具有以下属性:

Considering these factors, a version of the tree that would be better suited for disk implementation has to exhibit the following properties:

  • 高扇出可改善相邻键的局部性。

  • High fanout to improve locality of the neighboring keys.

  • 低高度以减少遍历期间的寻道次数。

  • Low height to reduce the number of seeks during traversal.

笔记

扇出和高度成反比:扇出越高,高度越低。如果扇出较高,每个节点可以容纳更多子节点,从而减少节点数量,从而降低高度。

Fanout and height are inversely correlated: the higher the fanout, the lower the height. If fanout is high, each node can hold more children, reducing the number of nodes and, subsequently, reducing height.

基于磁盘的结构

Disk-Based Structures

我们已经一般性地讨论了内存和基于磁盘的存储(请参阅“内存与基于磁盘的 DBMS” )。我们可以对特定的数据结构进行同样的区分:有些更适合在磁盘上使用,有些在内存中更适合使用。

We’ve talked about memory and disk-based storage (see “Memory- Versus Disk-Based DBMS”) in general terms. We can draw the same distinction for specific data structures: some are better suited to be used on disk and some work better in memory.

正如我们所讨论的,并非所有满足空间和复杂性要求的数据结构都可以有效地用于磁盘存储。数据库中使用的数据结构必须适应持久的介质限制。

As we have discussed, not every data structure that satisfies space and complexity requirements can be effectively used for on-disk storage. Data structures used in databases have to be adapted to account for persistent medium limitations.

当数据量太大以至于无法或不可行时,通常会使用磁盘数据结构。只有一小部分数据可以随时缓存在内存中,其余部分必须以允许有效访问的方式存储在磁盘上

On-disk data structures are often used when the amounts of data are so large that keeping an entire dataset in memory is impossible or not feasible. Only a fraction of the data can be cached in memory at any time, and the rest has to be stored on disk in a manner that allows efficiently accessing it.

硬盘驱动器

Hard Disk Drives

最多传统算法的开发时间是旋转磁盘是最广泛使用的持久存储介质,这极大地影响了它们的设计。后来存储介质有了新的发展,比如闪存驱动器激发了新的算法并对现有算法进行修改,利用新硬件的功能。如今,新型数据结构不断涌现,并经过优化以与非易失性字节可寻址存储一起使用(例如,[XIA17] [KANNAN18])。

Most traditional algorithms were developed when spinning disks were the most widespread persistent storage medium, which significantly influenced their design. Later, new developments in storage media, such as flash drives, inspired new algorithms and modifications to the existing ones, exploiting the capabilities of the new hardware. These days, new types of data structures are emerging, optimized to work with nonvolatile byte-addressable storage (for example, [XIA17] [KANNAN18]).

旋转磁盘、寻道会增加随机读取的成本,因为它们需要磁盘旋转和机械头移动来将读/写头定位到所需位置。然而,一旦完成了昂贵的部分,读取或写入连续字节(即顺序操作)就相对便宜。

On spinning disks, seeks increase costs of random reads because they require disk rotation and mechanical head movements to position the read/write head to the desired location. However, once the expensive part is done, reading or writing contiguous bytes (i.e., sequential operations) is relatively cheap.

旋转驱动器的最小传输单元是扇区因此当执行某些操作时,至少可以读取或写入整个扇区。扇区大小通常在 512 字节到 4 Kb 之间。

The smallest transfer unit of a spinning drive is a sector, so when some operation is performed, at least an entire sector can be read or written. Sector sizes typically range from 512 bytes to 4 Kb.

磁头定位是 HDD 操作中成本最高的部分。这这是我们经常听到顺序I/O的积极影响的原因之一:从磁盘读取和写入连续的内存段。

Head positioning is the most expensive part of an operation on the HDD. This is one of the reasons we often hear about the positive effects of sequential I/O: reading and writing contiguous memory segments from disk.

固态硬盘

Solid State Drives

固态硬盘 (SSD)没有移动部件:没有旋转的磁盘,也没有必须定位的磁头以供读取。典型的 SSD 由存储单元组成,连接成串通常每个串 32 到 64 个单元),串组合成数组,数组组合成,页组合成 [LARRIVEE15]

Solid state drives (SSDs) do not have moving parts: there’s no disk that spins, or head that has to be positioned for the read. A typical SSD is built of memory cells, connected into strings (typically 32 to 64 cells per string), strings are combined into arrays, arrays are combined into pages, and pages are combined into blocks [LARRIVEE15].

根据所使用的具体技术,单元可以保存一位或多位数据。页面大小因设备而异,但通常大小范围为 2 到 16 Kb。块通常包含 64 到 512 页。块被组织成平面,最后,平面被放置在一个。SSD 可以有一个或多个芯片。图 2-5显示了该层次结构。

Depending on the exact technology used, a cell can hold one or multiple bits of data. Pages vary in size between devices, but typically their sizes range from 2 to 16 Kb. Blocks typically contain 64 to 512 pages. Blocks are organized into planes and, finally, planes are placed on a die. SSDs can have one or more dies. Figure 2-5 shows this hierarchy.

数据库0205
图 2-5。SSD组织示意图

可以写入(编程)或读取的最小单位是页。然而,我们只能对空的存储单元进行更改(即,在写入之前已被擦除的存储单元)。最小的擦除实体不是一个页面,而是一个包含多个页面的块,这就是为什么它通常被称为擦除。空块中的页必须按顺序写入。

The smallest unit that can be written (programmed) or read is a page. However, we can only make changes to the empty memory cells (i.e., to ones that have been erased before the write). The smallest erase entity is not a page, but a block that holds multiple pages, which is why it is often called an erase block. Pages in an empty block have to be written sequentially.

闪存控制器的一部分,负责将页面 ID 映射到其物理位置,跟踪空页面、写入页面和丢弃页面,称为闪存转换层 (FTL)(有关 FTL 的更多信息,请参阅“闪存转换层” )它还负责垃圾收集,在此期间 FTL 找到可以安全擦除的块。某些块可能仍包含活动页面。在这种情况下,它将活动页面从这些块重新定位到新位置,并重新映射页面 ID 以指向那里。此后,它会擦除​​现在未使用的块,使它们可用于写入。

The part of a flash memory controller responsible for mapping page IDs to their physical locations, tracking empty, written, and discarded pages, is called the Flash Translation Layer (FTL) (see “Flash Translation Layer” for more about FTL). It is also responsible for garbage collection, during which FTL finds blocks it can safely erase. Some blocks might still contain live pages. In this case, it relocates live pages from these blocks to new locations and remaps page IDs to point there. After this, it erases the now-unused blocks, making them available for writes.

自从在两种设备类型(HDD 和 SSD)中,我们正在寻址内存块而不是单个字节(即按块访问数据),大多数操作系统都有块设备抽象[CESATI05]。它隐藏内部磁盘结构并在内部缓冲 I/O 操作,因此当我们从块设备读取单个字时,会读取包含该字的整个块。这是一个我们不能忽视的约束,并且在使用磁盘驻留数据结构时应该始终考虑到这一点。

Since in both device types (HDDs and SSDs) we are addressing chunks of memory rather than individual bytes (i.e., accessing data block-wise), most operating systems have a block device abstraction [CESATI05]. It hides an internal disk structure and buffers I/O operations internally, so when we’re reading a single word from a block device, the whole block containing it is read. This is a constraint we cannot ignore and should always take into account when working with disk-resident data structures.

在SSD中,我们并没有特别强调随机 I/O 与顺序 I/O(如 HDD),因为随机读取和顺序读取之间的延迟差异不大。预取、读取连续页和内部并行性仍然存在一些差异[GOOSSAERT14]

In SSDs, we don’t have a strong emphasis on random versus sequential I/O, as in HDDs, because the difference in latencies between random and sequential reads is not as large. There is still some difference caused by prefetching, reading contiguous pages, and internal parallelism [GOOSSAERT14].

尽管垃圾收集通常是后台操作,但其影响可能会对写入性能产生负面影响,特别是在随机和未对齐写入工作负载的情况下。

Even though garbage collection is usually a background operation, its effects may negatively impact write performance, especially in cases of random and unaligned write workloads.

仅写入完整块,并将后续写入组合到同一块,有助于减少所需 I/O 操作的数量。我们将在后面的章节中讨论缓冲和不变性作为实现这一目标的方法。

Writing only full blocks, and combining subsequent writes to the same block, can help to reduce the number of required I/O operations. We discuss buffering and immutability as ways to achieve that in later chapters.

磁盘结构

On-Disk Structures

除了磁盘访问本身的成本,构建高效磁盘结构的主要限制和设计条件是磁盘操作的最小单位是块。为了跟随指针指向块内的特定位置,我们必须获取整个块。由于我们已经必须这样做,因此我们可以更改数据结构的布局以利用它。

Besides the cost of disk access itself, the main limitation and design condition for building efficient on-disk structures is the fact that the smallest unit of disk operation is a block. To follow a pointer to the specific location within the block, we have to fetch an entire block. Since we already have to do that, we can change the layout of the data structure to take advantage of it.

我们已经在本章中多次提到了指针,但是这个词对于磁盘结构的语义略有不同。在磁盘上,大多数时候我们手动管理数据布局(除非,例如,我们使用内存映射文件)。这仍然与常规指针操作类似,但我们必须计算目标指针地址并显式地跟踪指针。

We’ve mentioned pointers several times throughout this chapter already, but this word has slightly different semantics for on-disk structures. On disk, most of the time we manage the data layout manually (unless, for example, we’re using memory mapped files). This is still similar to regular pointer operations, but we have to compute the target pointer addresses and follow the pointers explicitly.

大多数时候,磁盘上的偏移量是预先计算的(在指针在其指向的部分之前写入磁盘的情况下)或缓存在内存中,直到它们刷新到磁盘上。在磁盘结构中创建长依赖链会大大增加代码和结构的复杂性,因此最好将指针的数量及其跨度保持在最低限度。

Most of the time, on-disk offsets are precomputed (in cases when the pointer is written on disk before the part it points to) or cached in memory until they are flushed on the disk. Creating long dependency chains in on-disk structures greatly increases code and structure complexity, so it is preferred to keep the number of pointers and their spans to a minimum.

总之,磁盘结构的设计考虑了其目标存储细节,并且通常针对更少的磁盘访问进行优化。我们可以通过提高局部性、优化结构的内部表示以及减少页外指针的数量来实现这一点。

In summary, on-disk structures are designed with their target storage specifics in mind and generally optimize for fewer disk accesses. We can do this by improving locality, optimizing the internal representation of the structure, and reducing the number of out-of-page pointers.

“二叉搜索树”中,我们得出结论扇出低高度是最佳磁盘数据结构所需的属性。我们还刚刚讨论了来自指针的额外空间开销,以及由于平衡而重新映射这些指针所带来的维护开销。B-Tree 结合了这些思想:增加节点扇出,并减少树高、节点指针的数量和平衡操作的频率。

In “Binary Search Trees”, we came to the conclusion that high fanout and low height are desired properties for an optimal on-disk data structure. We’ve also just discussed additional space overhead coming from pointers, and maintenance overhead from remapping these pointers as a result of balancing. B-Trees combine these ideas: increase node fanout, and reduce tree height, the number of node pointers, and the frequency of balancing operations.

无处不在的 B 树

Ubiquitous B-Trees

我们比蜜蜂更勇敢,比树更长……

维尼熊

We are braver than a bee, and a… longer than a tree…

Winnie the Pooh

B树可以被认为是图书馆中一个巨大的目录室:你首先必须选择正确的柜子,然后选择柜子中正确的书架,然后选择书架上正确的抽屉,然后浏览抽屉中的卡片以找到所需的内容。您正在寻找的一个。类似地,B 树构建了一个层次结构,有助于快速导航和定位搜索的项目。

B-Trees can be thought of as a vast catalog room in the library: you first have to pick the correct cabinet, then the correct shelf in that cabinet, then the correct drawer on the shelf, and then browse through the cards in the drawer to find the one you’re searching for. Similarly, a B-Tree builds a hierarchy that helps to navigate and locate the searched items quickly.

正如我们在“二叉搜索树”中讨论的那样,B 树建立在平衡搜索树的基础上,其不同之处在于它们具有更高的扇出(有更多的子节点)和更小的高度。

As we discussed in “Binary Search Trees”, B-Trees build upon the foundation of balanced search trees and are different in that they have higher fanout (have more child nodes) and smaller height.

在大多数文献中,二叉树节点都被绘制为圆圈。由于每个节点只负责一个键并将范围分为两部分,因此这种详细程度是足够且直观的。同时,B-Tree节点通常被绘制为矩形,并且还显式地显示指针块以突出子节点和分隔键之间的关系。图 2-7 并排显示了二叉树、2-3 树和 B 树节点,这有助于理解它们之间的异同。

In most of the literature, binary tree nodes are drawn as circles. Since each node is responsible just for one key and splits the range into two parts, this level of detail is sufficient and intuitive. At the same time, B-Tree nodes are often drawn as rectangles, and pointer blocks are also shown explicitly to highlight the relationship between child nodes and separator keys. Figure 2-7 shows binary tree, 2-3-Tree, and B-Tree nodes side by side, which helps to understand the similarities and differences between them.

数据库0207
图 2-7。并排的二叉树、2-3 树和 B 树节点

没有什么可以阻止我们以同样的方式描述二叉树。两种结构具有相似的指针跟随语义,并且在如何维持平衡方面开始显现出差异。图 2-8显示并暗示了 BST 和 B-Tree 之间的相似性:在这两种情况下,键都将树拆分为子树,并用于导航树和查找搜索到的键。您可以将其与图 2-1进行比较。

Nothing prevents us from depicting binary trees in the same way. Both structures have similar pointer-following semantics, and differences start showing in how the balance is maintained. Figure 2-8 shows that and hints at similarities between BSTs and B-Trees: in both cases, keys split the tree into subtrees, and are used for navigating the tree and finding searched keys. You can compare it to Figure 2-1.

数据库0208
图 2-8。二叉树的替代表示

B 树是排序的:B 树节点内的键按顺序存储。因此,为了找到搜索到的键,我们可以使用像二分搜索这样的算法。这也意味着 B 树中的查找具有对数复杂度。例如,在 40 亿 ( ) 项中查找搜索关键字大约需要 32 次比较(有关此主题的更多信息,请参阅“B 树查找复杂性” )。如果我们必须对这些比较中的每一个进行一次磁盘查找,这会显着减慢我们的速度,但由于 B 树节点存储数十甚至数百个项目,因此我们只需在每次级别跳转时进行一次磁盘查找。我们将在本章后面更详细地讨论查找算法。4 × 109

B-Trees are sorted: keys inside the B-Tree nodes are stored in order. Because of that, to locate a searched key, we can use an algorithm like binary search. This also implies that lookups in B-Trees have logarithmic complexity. For example, finding a searched key among 4 billion (4 × 109) items takes about 32 comparisons (see “B-Tree Lookup Complexity” for more on this subject). If we had to make a disk seek for each one of these comparisons, it would significantly slow us down, but since B-Tree nodes store dozens or even hundreds of items, we only have to make one disk seek per level jump. We’ll discuss a lookup algorithm in more detail later in this chapter.

使用B树,我们可以高效地执行查询范围查询。点查询在大多数查询语言中由等式 ( =) 谓词表示,定位单个项目。另一方面,范围查询由比较谓词(<>)表示,用于按顺序查询多个数据项。

Using B-Trees, we can efficiently execute both point and range queries. Point queries, expressed by the equality (=) predicate in most query languages, locate a single item. On the other hand, range queries, expressed by comparison (<, >, , and ) predicates, are used to query multiple data items in order.

B树层次结构

B-Tree Hierarchy

B树由多个节点组成。每个节点都保存指向子节点的N键和指针。N + 1这些节点在逻辑上分为三组:

B-Trees consist of multiple nodes. Each node holds up to N keys and N + 1 pointers to the child nodes. These nodes are logically grouped into three groups:

根节点
Root node

没有父母,是树的顶端。

This has no parents and is the top of the tree.

叶节点
Leaf nodes

这些是没有子节点的底层节点。

These are the bottom layer nodes that have no child nodes.

内部节点
Internal nodes

这些是所有其他节点,连接根和叶子。通常有不止一级的内部节点。

These are all other nodes, connecting root with leaves. There is usually more than one level of internal nodes.

该层次结构如图 2-9所示。

This hierarchy is shown in Figure 2-9.

数据库0209
图 2-9。B-Tree 节点层次结构

由于 B 树是作为一种页面组织技术(即,它们用于组织和导航固定大小的页面),我们经常互换使用术语“节点”“页面”

Since B-Trees are a page organization technique (i.e., they are used to organize and navigate fixed-size pages), we often use terms node and page interchangeably.

节点容量与其实际持有的密钥数量之间的关系称为占用率

The relation between the node capacity and the number of keys it actually holds is called occupancy.

B树其特征在于它们的扇出:每个节点中存储的密钥数量。较高的扇出有助于分摊保持树平衡所需的结构更改的成本,并通过将指向子节点的键和指针存储在单个块或多个连续块中来减少查找次数。平衡运营当节点已满或几乎为空时,会触发(即拆分合并)。

B-Trees are characterized by their fanout: the number of keys stored in each node. Higher fanout helps to amortize the cost of structural changes required to keep the tree balanced and to reduce the number of seeks by storing keys and pointers to child nodes in a single block or multiple consecutive blocks. Balancing operations (namely, splits and merges) are triggered when the nodes are full or nearly empty.

分隔键

Separator Keys

已存储钥匙B 树中的节点称为索引条目分隔符键分隔符单元。他们将树拆分为子树(也称为分支子范围),并保存相应的键范围。键按排序顺序存储,以便进行二分搜索。通过定位键并跟随相应的指针从较高层到较低层来找到子树。

Keys stored in B-Tree nodes are called index entries, separator keys, or divider cells. They split the tree into subtrees (also called branches or subranges), holding corresponding key ranges. Keys are stored in sorted order to allow binary search. A subtree is found by locating a key and following a corresponding pointer from the higher to the lower level.

节点中的第一个指针指向保存小于第一个键的项的子树,节点中的最后一个指针指向保存大于或等于最后一个键的项的子树。其他指针是两个键之间的引用子树: ,其中是一组键,是属于该子树的键。图 2-10显示了这些不变量。Ki-1 ≤ Ks < KiKKs

The first pointer in the node points to the subtree holding items less than the first key, and the last pointer in the node points to the subtree holding items greater than or equal to the last key. Other pointers are reference subtrees between the two keys: Ki-1 ≤ Ks < Ki, where K is a set of keys, and Ks is a key that belongs to the subtree. Figure 2-10 shows these invariants.

数据库0210
图 2-10。分隔符键如何将树拆分为子树

一些 B 树变体还具有同级节点指针,通常位于叶级别,以简化范围扫描。这些指针有助于避免返回父级来查找下一个兄弟姐妹。一些实现具有双向指针,在叶级别形成双链表,这使得反向迭代成为可能。

Some B-Tree variants also have sibling node pointers, most often on the leaf level, to simplify range scans. These pointers help avoid going back to the parent to find the next sibling. Some implementations have pointers in both directions, forming a double-linked list on the leaf level, which makes the reverse iteration possible.

B 树的与众不同之处在于,它们不是从上到下构建的(如二叉搜索树),而是相反的方式构建——从下到上。叶节点的数量增加,从而增加了内部节点的数量和树的高度。

What sets B-Trees apart is that, rather than being built from top to bottom (as binary search trees), they’re constructed the other way around—from bottom to top. The number of leaf nodes grows, which increases the number of internal nodes and tree height.

由于 B 树在节点内保留额外空间以供将来插入和更新,因此树存储利用率可低至 50%,但通常要高得多。较高的占用率不会对 B-Tree 性能产生负面影响。

Since B-Trees reserve extra space inside nodes for future insertions and updates, tree storage utilization can get as low as 50%, but is usually considerably higher. Higher occupancy does not influence B-Tree performance negatively.

B 树查找复杂度

B-Tree Lookup Complexity

B树查找复杂性可以从两个角度来看:块传输的数量和查找期间完成的比较的数量。

B-Tree lookup complexity can be viewed from two standpoints: the number of block transfers and the number of comparisons done during the lookup.

就传输数量而言,对数底为N(每个节点的密钥数量)。每个新级别上都有K更多的节点,并且跟随子指针会减少 的搜索空间N。在查找过程中,最多(其中是 B 树中的项目总数)页被寻址以查找搜索到的键。从根到叶的传递必须遵循的子指针的数量也等于层数,换句话说,就是树的高度。logK MMh

In terms of number of transfers, the logarithm base is N (number of keys per node). There are K times more nodes on each new level, and following a child pointer reduces the search space by the factor of N. During lookup, at most logK M (where M is a total number of items in the B-Tree) pages are addressed to find a searched key. The number of child pointers that have to be followed on the root-to-leaf pass is also equal to the number of levels, in other words, the height h of the tree.

从比较次数的角度来看,对数底是2,因为在每个节点内搜索键是使用二分搜索完成的。每次比较都会将搜索空间减半,因此复杂度为。log2 M

From the perspective of number of comparisons, the logarithm base is 2, since searching a key inside each node is done using binary search. Every comparison halves the search space, so complexity is log2 M.

了解查找次数和比较次数之间的区别有助于我们直观地了解搜索是如何执行的,并从两个角度理解什么是查找复杂性。

Knowing the distinction between the number of seeks and the number of comparisons helps us gain the intuition about how searches are performed and understand what lookup complexity is, from both perspectives.

在教科书和文章中,2 B 树查找复杂度通常被称为log M。对数底数通常不用于复杂性分析,因为改变底数只是添加一个常数因子,而乘以一个常数因子不会改变复杂性。例如,给定非零常数因子cO(|c| × n) == O(n) [KNUTH97]

In textbooks and articles,2 B-Tree lookup complexity is generally referenced as log M. Logarithm base is generally not used in complexity analysis, since changing the base simply adds a constant factor, and multiplication by a constant factor does not change complexity. For example, given the nonzero constant factor c, O(|c| × n) == O(n) [KNUTH97].

B树查找算法

B-Tree Lookup Algorithm

现在我们已经介绍了 B 树的结构和内部组织,我们可以定义查找、插入和删除的算法。要在 B 树中查找项目,我们必须执行从根到叶的单次遍历。此搜索的目的是找到搜索到的关键字或其前身。查找精确匹配用于点查询、更新和删除;找到它的前身对于范围扫描和插入很有用。

Now that we have covered the structure and internal organization of B-Trees, we can define algorithms for lookups, insertions, and removals. To find an item in a B-Tree, we have to perform a single traversal from root to leaf. The objective of this search is to find a searched key or its predecessor. Finding an exact match is used for point queries, updates, and deletions; finding its predecessor is useful for range scans and inserts.

该算法从根节点开始,执行二分查找,将搜索到的键与根节点中存储的键进行比较,直到找到第一个大于搜索值的分隔符键。这将定位搜索到的子树。正如我们之前讨论的,索引键将树分割成子树,子树的边界位于两个相邻键之间。一旦找到子树,我们就沿着与其对应的指针继续相同的搜索过程(找到分隔符键,沿着指针),直到到达目标叶节点,在那里我们找到搜索到的键或得出结论通过查找其前身不存在。

The algorithm starts from the root and performs a binary search, comparing the searched key with the keys stored in the root node until it finds the first separator key that is greater than the searched value. This locates a searched subtree. As we’ve discussed previously, index keys split the tree into subtrees with boundaries between two neighboring keys. As soon as we find the subtree, we follow the pointer that corresponds to it and continue the same search process (locate the separator key, follow the pointer) until we reach a target leaf node, where we either find the searched key or conclude it is not present by locating its predecessor.

在每个级别上,我们都会获得树的更详细视图:我们从最粗粒度的级别(树的根)开始,然后下降到下一个级别,其中键代表更精确、详细的范围,直到我们最终到达叶子,数据记录所在的位置。

On each level, we get a more detailed view of the tree: we start on the most coarse-grained level (the root of the tree) and descend to the next level where keys represent more precise, detailed ranges, until we finally reach leaves, where the data records are located.

点查询时,在找到或者没有找到所查找的key后进行查找。在范围扫描期间,迭代从最接近的找到的键值对开始,并通过跟随兄弟指针继续,直到到达范围的末尾或范围谓词耗尽。

During the point query, the search is done after finding or failing to find the searched key. During the range scan, iteration starts from the closest found key-value pair and continues by following sibling pointers until the end of the range is reached or the range predicate is exhausted.

计算按键

Counting Keys

穿过在文献中,您可以找到不同的方法来描述键和子偏移量计数。[BAYER72]k提到了代表最佳页面大小的与设备相关的自然数。在这种情况下,页面可以在k2k键之间保存,但可以部分填充并保存至少k + 1且最多2k + 1指向子节点的指针。根页面可以容纳12k键之间。后来l引入了一个数字,据说任何非叶页都可以有l + 1键。

Across the literature, you can find different ways to describe key and child offset counts. [BAYER72] mentions the device-dependent natural number k that represents an optimal page size. Pages, in this case, can hold between k and 2k keys, but can be partially filled and hold at least k + 1 and at most 2k + 1 pointers to child nodes. The root page can hold between 1 and 2k keys. Later, a number l is introduced, and it is said that any nonleaf page can have l + 1 keys.

其他源,例如[GRAEFE11],描述了可以容纳N 分隔符键N + 1 指针的节点,具有其他相似的语义和不变量。

Other sources, for example [GRAEFE11], describe nodes that can hold up to N separator keys and N + 1 pointers, with otherwise similar semantics and invariants.

两种方法都给我们带来相同的结果,差异只是为了强调每个来源的内容。为了清楚起见,在本书中,我们坚持使用N键(或键值对,在叶节点的情况下)的数量。

Both approaches bring us to the same result, and differences are only used to emphasize the contents of each source. In this book, we stick to N as the number of keys (or key-value pairs, in the case of the leaf nodes) for clarity.

B 树节点分裂

B-Tree Node Splits

将值插入 B 树时,我们首先必须找到目标叶子并找到插入点。为此,我们使用上一节中描述的算法。找到叶子后,会将键和值附加到其上。B 树中的更新通过使用查找算法定位目标叶节点并将新值与现有键相关联来工作。

To insert the value into a B-Tree, we first have to locate the target leaf and find the insertion point. For that, we use the algorithm described in the previous section. After the leaf is located, the key and value are appended to it. Updates in B-Trees work by locating a target leaf node using a lookup algorithm and associating a new value with an existing key.

如果目标节点没有足够的可用空间,我们说该节点已溢出 [NICHOLS66],必须分成两部分以适应新数据。更准确地说,如果满足以下条件,则节点将被分裂:

If the target node doesn’t have enough room available, we say that the node has overflowed [NICHOLS66] and has to be split in two to fit the new data. More precisely, the node is split if the following conditions hold:

  • 对于叶子节点:如果该节点可以容纳最多个N键值对,那么再插入一个键值对就会超过最大容量N

  • For leaf nodes: if the node can hold up to N key-value pairs, and inserting one more key-value pair brings it over its maximum capacity N.

  • 对于非叶节点:如果该节点可以容纳N + 1指针,那么再插入一个指针就会使其超过其最大容量N + 1

  • For nonleaf nodes: if the node can hold up to N + 1 pointers, and inserting one more pointer brings it over its maximum capacity N + 1.

分裂是通过分配新节点、将分裂节点的一半元素转移到新节点、并将其第一个键和指向父节点的指针添加来完成的。在这种情况下,我们说关键是晋升。执行分割的索引称为分割(也称为中点)。分裂点之后的所有元素(包括非叶节点分裂情况下的分裂点)都转移到新创建的兄弟节点,其余元素保留在分裂节点中。

Splits are done by allocating the new node, transferring half the elements from the splitting node to it, and adding its first key and pointer to the parent node. In this case, we say that the key is promoted. The index at which the split is performed is called the split point (also called the midpoint). All elements after the split point (including split point in the case of nonleaf node split) are transferred to the newly created sibling node, and the rest of the elements remain in the splitting node.

如果父节点已满并且没有空间可用于提升的键和指向新创建节点的指针,则它也必须被拆分。此操作可能会递归地一直传播到根。

If the parent node is full and does not have space available for the promoted key and pointer to the newly created node, it has to be split as well. This operation might propagate recursively all the way to the root.

一旦树达到其容量(即,分裂一直传播到根),我们就必须分裂根节点。当根节点被分裂时,会分配一个持有分裂点密钥的新根。旧的根(现在只保存一半的条目)与其新创建的同级一起被降级到下一个级别,从而将树的高度增加一倍。当根节点被分裂并分配新的根时,或者当两个节点合并形成新的根时,树的高度会发生变化。在叶子和内部节点级别上,树仅水平生长。

As soon as the tree reaches its capacity (i.e., split propagates all the way up to the root), we have to split the root node. When the root node is split, a new root, holding a split point key, is allocated. The old root (now holding only half the entries) is demoted to the next level along with its newly created sibling, increasing the tree height by one. The tree height changes when the root node is split and the new root is allocated, or when two nodes are merged to form a new root. On the leaf and internal node levels, the tree only grows horizontally.

图 2-11显示了插入新元素期间完全被占用的11节点。我们在完整节点的中间画一条线,将一半元素保留在节点中,并将其余元素移动到新节点。分割点值被放置到父节点中作为分隔符键。

Figure 2-11 shows a fully occupied leaf node during insertion of the new element 11. We draw the line in the middle of the full node, leave half the elements in the node, and move the rest of elements to the new one. A split point value is placed into the parent node to serve as a separator key.

数据库0211
图 2-11。插入期间叶节点分裂11。新元素和升级的键以灰色显示。

图2-12显示了一个完全占用的非叶节点(即根节点或内部节点)在插入新元素时的分裂过程11。要执行拆分,我们首先创建一个新节点并将从索引开始的元素移动N/2 + 1到该节点。分割点密钥被提升到父级。

Figure 2-12 shows the split process of a fully occupied nonleaf (i.e., root or internal) node during insertion of the new element 11. To perform a split, we first create a new node and move elements starting from index N/2 + 1 to it. The split point key is promoted to the parent.

数据库0212
图 2-12。插入期间非叶节点分裂11。新元素和升级的键以灰色显示。

由于非叶节点分裂始终是从下面的级别传播的分裂的表现,因此我们有一个额外的指针(指向下一层新创建的节点)。如果父级没有足够的空间,则也必须将其拆分。

Since nonleaf node splits are always a manifestation of splits propagating from the levels below, we have an additional pointer (to the newly created node on the next level). If the parent does not have enough space, it has to be split as well.

叶节点或非叶节点是否被拆分(即节点是否保存键和值或仅保存键)并不重要。在叶分割的情况下,键与其关联的值一起移动。

It doesn’t matter whether the leaf or nonleaf node is split (i.e., whether the node holds keys and values or just the keys). In the case of leaf split, keys are moved together with their associated values.

分割完成后,我们有两个节点,必须选择正确的一个才能完成插入。为此,我们可以使用分隔符键不变量。如果插入的键小于提升的键,我们通过插入到分裂节点来完成操作。否则,我们将插入到新创建的中。

When the split is done, we have two nodes and have to pick the correct one to finish insertion. For that, we can use the separator key invariants. If the inserted key is less than the promoted one, we finish the operation by inserting to the split node. Otherwise, we insert to the newly created one.

总而言之,节点分裂分四个步骤完成:

To summarize, node splits are done in four steps:

  1. 分配一个新节点。

  2. Allocate a new node.

  3. 将分裂节点的一半元素复制到新节点。

  4. Copy half the elements from the splitting node to the new one.

  5. 将新元素放入相应的节点中。

  6. Place the new element into the corresponding node.

  7. 在分割节点的父节点处,添加分隔符键和指向新节点的指针。

  8. At the parent of the split node, add a separator key and a pointer to the new node.

B 树节点合并

B-Tree Node Merges

删除通过首先定位目标叶子来完成。当找到叶子时,与其关联的键和值将被删除。

Deletions are done by first locating the target leaf. When the leaf is located, the key and the value associated with it are removed.

如果相邻节点的值太少(即,它们的占用率低于阈值),则合并兄弟节点。这这种情况称为下溢[BAYER72]描述了两种下溢场景:如果两个相邻节点有一个共同的父节点并且它们的内容适合单个节点,则它们的内容应该被合并(连接);如果它们的内容不适合单个节点,则会在它们之间重新分配密钥以恢复平衡(请参阅“重新平衡”)。更准确地说,如果满足以下条件,则两个节点将被合并:

If neighboring nodes have too few values (i.e., their occupancy falls under a threshold), the sibling nodes are merged. This situation is called underflow. [BAYER72] describes two underflow scenarios: if two adjacent nodes have a common parent and their contents fit into a single node, their contents should be merged (concatenated); if their contents do not fit into a single node, keys are redistributed between them to restore balance (see “Rebalancing”). More precisely, two nodes are merged if the following conditions hold:

  • 对于叶子节点:如果一个节点最多可以容纳N键值对,并且两个相邻节点的键值对数量之和小于或等于N

  • For leaf nodes: if a node can hold up to N key-value pairs, and a combined number of key-value pairs in two neighboring nodes is less than or equal to N.

  • 对于非叶节点:如果一个节点最多可以容纳N + 1指针,并且两个相邻节点的指针数量之和小于或等于N + 1

  • For nonleaf nodes: if a node can hold up to N + 1 pointers, and a combined number of pointers in two neighboring nodes is less than or equal to N + 1.

图 2-13显示了删除 element 期间的合并16。为此,我们将元素从一个兄弟姐妹移动到另一个兄弟姐妹。通常,右侧同级的元素会移动到左侧,但只要保留键顺序,也可以以相反的方式完成。

Figure 2-13 shows the merge during deletion of element 16. To do this, we move elements from one of the siblings to the other one. Generally, elements from the right sibling are moved to the left one, but it can be done the other way around as long as the key order is preserved.

数据库0213
图 2-13。叶节点合并

图 2-14显示了在删除 element 期间必须合并的两个兄弟非叶节点10。如果我们将它们的元素组合起来,它们就会适合一个节点,因此我们可以有一个节点而不是两个。在合并非叶节点期间,我们必须从父节点中提取相应的分隔符键(即,将其降级)。指针的数量减少了 1,因为合并是由页面删除引起的指针删除从较低级别传播的结果。就像拆分一样,合并可以一直传播到根级别。

Figure 2-14 shows two sibling nonleaf nodes that have to be merged during deletion of element 10. If we combine their elements, they fit into one node, so we can have one node instead of two. During the merge of nonleaf nodes, we have to pull the corresponding separator key from the parent (i.e., demote it). The number of pointers is reduced by one because the merge is a result of the propagation of the pointer deletion from the lower level, caused by the page removal. Just as with splits, merges can propagate all the way to the root level.

数据库0214
图 2-14。非叶节点合并

总而言之,假设元素已被删除,节点合并分三个步骤完成:

To summarize, node merges are done in three steps, assuming the element is already removed:

  1. 将右侧节点的所有元素复制到左侧节点。

  2. Copy all elements from the right node to the left one.

  3. 从父节点中删除节点指针(或者在非叶合并的情况下将其降级)。

  4. Remove the right node pointer from the parent (or demote it in the case of a nonleaf merge).

  5. 删除右侧节点。

  6. Remove the right node.

在 B 树中经常实现的用于减少拆分和合并次数的技术之一是重新平衡,我们将在“重新平衡”中讨论该技术。

One of the techniques often implemented in B-Trees to reduce the number of splits and merges is rebalancing, which we discuss in “Rebalancing”.

概括

Summary

在本章中,我们的动机是为磁盘存储创建专门的结构。二叉搜索树可能具有类似的复杂性特征,但由于低扇出和平衡引起的大量重定位和指针更新,仍然不适合磁盘。B 树通过增加每个节点中存储的项目数量(高扇出)和减少平衡操作频率来解决这两个问题。

In this chapter, we started with a motivation to create specialized structures for on-disk storage. Binary search trees might have similar complexity characteristics, but still fall short of being suitable for disk because of low fanout and a large number of relocations and pointer updates caused by balancing. B-Trees solve both problems by increasing the number of items stored in each node (high fanout) and less frequent balancing operations.

之后,我们讨论了 B 树的内部结构以及查找、插入和删除操作的算法概述。拆分和合并操作有助于重构树,以在添加和删除元素时保持平衡。我们将树的深度保持在最低限度,并在现有节点中仍然有一些可用空间的情况下将项目添加到其中。

After that, we discussed internal B-Tree structure and outlines of algorithms for lookup, insert, and delete operations. Split and merge operations help to restructure the tree to keep it balanced while adding and removing elements. We keep the tree depth to a minimum and add items to the existing nodes while there’s still some free space in them.

我们可以使用这些知识来创建内存中的 B 树。为了创建基于磁盘的实现,我们需要详细了解如何在磁盘上布局 B-Tree 节点并使用数据编码格式组成磁盘布局。

We can use this knowledge to create in-memory B-Trees. To create a disk-based implementation, we need to go into details of how to lay out B-Tree nodes on disk and compose on-disk layout using data-encoding formats.

1此属性是由 AVL 树和其他几种数据结构强加的。更一般地,二叉搜索树将子树之间的高度差保持在一个小的常数因子内。

1 This property is imposed by AVL Trees and several other data structures. More generally, binary search trees keep the difference in heights between subtrees within a small constant factor.

2例如,[KNUTH98]

2 For example, [KNUTH98].

第 3 章文件格式

Chapter 3. File Formats

介绍完 B 树的基本语义后,我们现在准备探索 B 树和其他结构在磁盘上的具体实现方式。我们访问磁盘的方式与访问主内存的方式不同:从应用程序开发人员的角度来看,内存访问大多是透明的。由于虚拟内存[BHATTACHARJEE17],我们不必手动管理偏移量。使用系统调用访问磁盘(请参阅https://databass.dev/links/54)。我们通常必须指定目标文件内的偏移量,然后将磁盘上的表示形式解释为适合主内存的形式。

With the basic semantics of B-Trees covered, we are now ready to explore how exactly B-Trees and other structures are implemented on disk. We access the disk in a way that is different from how we access main memory: from an application developer’s perspective, memory accesses are mostly transparent. Because of virtual memory [BHATTACHARJEE17], we do not have to manage offsets manually. Disks are accessed using system calls (see https://databass.dev/links/54). We usually have to specify the offset inside the target file, and then interpret on-disk representation into a form suitable for main memory.

这意味着高效设计磁盘结构时必须考虑到这种区别。为此,我们必须提出一种易于构建、修改和解释的文件格式。在本章中,我们将讨论帮助我们设计各种磁盘结构(而不​​仅仅是 B 树)的一般原则和实践。

This means that efficient on-disk structures have to be designed with this distinction in mind. To do that, we have to come up with a file format that’s easy to construct, modify, and interpret. In this chapter, we’ll discuss general principles and practices that help us to design all sorts of on-disk structures, not only B-Trees.

B 树实现有多种可能性,在这里我们讨论几种有用的技术。不同实现之间的细节可能有所不同,但一般原则保持不变。了解 B 树的基本机制(例如拆分和合并)是必要的,但对于实际实现来说它们还不够。有很多事情必须一起发挥作用才能使最终结果有用。

There are numerous possibilities for B-Tree implementations, and here we discuss several useful techniques. Details may vary between implementations, but the general principles remain the same. Understanding the basic mechanics of B-Trees, such as splits and merges, is necessary, but they are insufficient for the actual implementation. There are many things that have to play together for the final result to be useful.

磁盘结构中指针管理的语义与内存结构中的指针管理有些不同。将磁盘上的 B 树视为一种页面管理机制很有用:算法必须组合和导航页面。必须相应地计算和放置页面和指向它们的指针。

The semantics of pointer management in on-disk structures are somewhat different from in-memory ones. It is useful to think of on-disk B-Trees as a page management mechanism: algorithms have to compose and navigate pages. Pages and pointers to them have to be calculated and placed accordingly.

由于 B 树的大部分复杂性来自可变性,因此我们讨论页面布局、拆分、重定位以及适用于可变数据结构的其他概念的详细信息。稍后,在讨论 LSM 树(请参阅“LSM 树”)时,我们重点关注排序和维护,因为这是大多数 LSM 复杂性的来源。

Since most of the complexity in B-Trees comes from mutability, we discuss details of page layouts, splitting, relocations, and other concepts applicable to mutable data structures. Later, when talking about LSM Trees (see “LSM Trees”), we focus on sorting and maintenance, since that’s where most LSM complexity comes from.

动机

Motivation

创建文件格式在很多方面与我们使用非托管内存模型的语言创建数据结构的方式相似。我们分配一个数据块,并使用固定大小的原语和结构以任何我们喜欢的方式对其进行切片。如果我们想要引用更大的内存块或具有可变大小的结构,我们可以使用指针。

Creating a file format is in many ways similar to how we create data structures in languages with an unmanaged memory model. We allocate a block of data and slice it any way we like, using fixed-size primitives and structures. If we want to reference a larger chunk of memory or a structure with variable size, we use pointers.

具有非托管内存模型的语言允许我们在需要时(在合理的范围内)分配更多内存,而无需考虑或担心是否有连续的内存段可用,是否是碎片,或者之后会发生什么我们释放它。在磁盘上,我们必须自己处理垃圾收集和碎片整理。

Languages with an unmanaged memory model allow us to allocate more memory any time we need (within reasonable bounds) without us having to think or worry about whether or not there’s a contiguous memory segment available, whether or not it is fragmented, or what happens after we free it. On disk, we have to take care of garbage collection and fragmentation ourselves.

数据布局在内存中的重要性远不如在磁盘中的重要。为了使磁盘驻留数据结构高效,我们需要以允许快速访问数据的方式在磁盘上布置数据,并考虑持久存储介质的具体情况,提出二进制数据格式,并找到一种方法来存储数据。有效地序列化和反序列化数据。

Data layout is much less important in memory than on disk. For a disk-resident data structure to be efficient, we need to lay out data on disk in ways that allow quick access to it, and consider the specifics of a persistent storage medium, come up with binary data formats, and find a means to serialize and deserialize data efficiently.

任何曾经使用过 C 等低级语言且没有附加库的人都知道这些限制。结构具有预定义的大小并显式分配和释放。手动实现内存分配和跟踪更具挑战性,因为只能使用预定义大小的内存段进行操作,并且有必要跟踪哪些段已释放以及哪些段仍在使用中。

Anyone who has ever used a low-level language such as C without additional libraries knows the constraints. Structures have a predefined size and are allocated and freed explicitly. Manually implementing memory allocation and tracking is even more challenging, since it is only possible to operate with memory segments of predefined size, and it is necessary to track which segments are already released and which ones are still in use.

当将数据存储在主存中时,大多数内存布局的问题都不存在,更容易解决,或者可以使用第三方库来解决。例如,处理可变长度字段和超大数据更加简单,因为我们可以使用内存分配和指针,并且不需要以任何特殊方式布置它们。在某些情况下,开发人员仍然会设计专门的主内存数据布局来利用 CPU 缓存行、预取和其他与硬件相关的细节,但这主要是为了优化目的 [FOWLER11 ]

When storing data in main memory, most of the problems with memory layout do not exist, are easier to solve, or can be solved using third-party libraries. For example, handling variable-length fields and oversize data is much more straightforward, since we can use memory allocation and pointers, and do not need to lay them out in any special way. There still are cases when developers design specialized main memory data layouts to take advantage of CPU cache lines, prefetching, and other hardware-related specifics, but this is mainly done for optimization purposes [FOWLER11].

尽管操作系统和文件系统承担了一些职责,但实现磁盘结构需要关注更多细节,并且存在更多陷阱。

Even though the operating system and filesystem take over some of the responsibilities, implementing on-disk structures requires attention to more details and has more pitfalls.

二进制编码

Binary Encoding

为了有效地将数据存储在磁盘上,需要使用紧凑且易于序列化和反序列化的格式对其进行编码。当谈论二进制格式时,您经常会听到“布局”这个词。由于我们没有mallocand等基元free,而只有readwrite,因此我们必须以不同的方式考虑访问并相应地准备数据。

To store data on disk efficiently, it needs to be encoded using a format that is compact and easy to serialize and deserialize. When talking about binary formats, you hear the word layout quite often. Since we do not have primitives such as malloc and free, but only read and write, we have to think of accesses differently and prepare data accordingly.

在这里,我们讨论用于创建高效页面布局的主要原则。这些原则适用于任何二进制格式:您可以使用类似的准则来创建文件和序列化格式或通信协议。

Here, we discuss the main principles used to create efficient page layouts. These principles apply to any binary format: you can use similar guidelines to create file and serialization formats or communication protocols.

在将记录组织成页面之前,我们需要了解如何以二进制形式表示键和数据记录,如何将多个值组合成更复杂的结构,以及如何实现可变大小类型和数组。

Before we can organize records into pages, we need to understand how to represent keys and data records in binary form, how to combine multiple values into more complex structures, and how to implement variable-size types and arrays.

原始类型

Primitive Types

键和值具有类型,例如integerdate、 或string,并且可以以其原始二进制形式表示(序列化和反序列化)。

Keys and values have a type, such as integer, date, or string, and can be represented (serialized to and deserialized from) in their raw binary forms.

大多数数值数据类型都表示为固定大小的值。当使用多字节数值时,使用编码和解码的字节顺序字节序)相同。Endianness 决定字节的顺序:

Most numeric data types are represented as fixed-size values. When working with multibyte numeric values, it is important to use the same byte-order (endianness) for both encoding and decoding. Endianness determines the sequential order of bytes:

大尾数
Big-endian

订单从最高有效字节 (MSB),后面是按重要性递减顺序排列的字节。换句话说,MSB 具有最低地址。

The order starts from the most-significant byte (MSB), followed by the bytes in decreasing significance order. In other words, MSB has the lowest address.

小尾数法
Little-endian

顺序从最低有效字节 (LSB) 开始,后面跟着按重要性递增的顺序排列的字节。

The order starts from the least-significant byte (LSB), followed by the bytes in increasing significance order.

图 3-1说明了这一点。十六进制 32 位整数0xAABBCCDD,其中AA是 MSB,使用大端和小端字节顺序显示。

Figure 3-1 illustrates this. The hexadecimal 32-bit integer 0xAABBCCDD, where AA is the MSB, is shown using both big- and little-endian byte order.

数据库0301
图 3-1。大端和小端字节顺序。最高有效字节以灰色显示。地址(用 表示)a从左向右增长。

例如,要重建具有相应字节顺序的 64 位整数,RocksDB 具有特定于平台的定义,有助于识别目标平台字节顺序1如果目标平台字节序与值字节序不匹配(EncodeFixed64WithEndian查找kLittleEndian值并将其与值字节序进行比较),则使用 反转字节EndianTransform,以相反的顺序按字节读取值并将它们附加到结果中。

For example, to reconstruct a 64-bit integer with a corresponding byte order, RocksDB has platform-specific definitions that help to identify target platform byte order.1 If the target platform endianness does not match value endianness (EncodeFixed64WithEndian looks up kLittleEndian value and compares it with value endianness), it reverses the bytes using EndianTransform, which reads values byte-wise in reverse order and appends them to the result.

记录由数字、字符串、布尔值及其组合等基元组成。然而,当通过网络传输数据或将其存储在磁盘上时,我们只能使用字节序列。这意味着,为了发送或写入记录,我们必须序列化它(将其转换为可解释的字节序列),并且在接收或读取后使用它之前,我们有序列化它(将字节序列转换回原始记录)。

Records consist of primitives like numbers, strings, booleans, and their combinations. However, when transferring data over the network or storing it on disk, we can only use byte sequences. This means that, in order to send or write the record, we have to serialize it (convert it to an interpretable sequence of bytes) and, before we can use it after receiving or reading, we have to deserialize it (translate the sequence of bytes back to the original record).

在二进制数据格式中,我们总是从用作更复杂结构构建块的基元开始。不同的数字类型的大小可能不同。byte值是 8 位、short2 字节(16 位)、int4 字节(32 位)、long8 字节(64 位)。

In binary data formats, we always start with primitives that serve as building blocks for more complex structures. Different numeric types may vary in size. byte value is 8 bits, short is 2 bytes (16 bits), int is 4 bytes (32 bits), and long is 8 bytes (64 bits).

浮点数字(例如float和)由它们的符号分数指数double表示。IEEE 二进制浮点运算标准( IEEE 754) 标准描述了广泛接受的浮点数表示形式。32 位表示单精度值。例如,浮点数具有二进制表示形式,如图3-2所示。前23位表示分数,后面8位表示指数,1位表示符号(无论数字是否为负)。float0.15652

Floating-point numbers (such as float and double) are represented by their sign, fraction, and exponent. The IEEE Standard for Binary Floating-Point Arithmetic (IEEE 754) standard describes widely accepted floating-point number representation. A 32-bit float represents a single-precision value. For example, a floating-point number 0.15652 has a binary representation, as shown in Figure 3-2. The first 23 bits represent a fraction, the following 8 bits represent an exponent, and 1 bit represents a sign (whether or not the number is negative).

数据库0302
图 3-2。单精度浮点数的二进制表示

由于浮点值是使用分数计算的,因此该表示产生的数字只是一个近似值。讨论完整的转换算法超出了本书的范围,我们只讨论表示基础知识。

Since a floating-point value is calculated using fractions, the number this representation yields is just an approximation. Discussing a complete conversion algorithm is out of the scope of this book, and we only cover representation basics.

表示double双精度浮点值[SAVARD05]。大多数编程语言都具有对浮点​​值与其标准库中的二进制表示进行编码和解码的方法。

The double represents a double-precision floating-point value [SAVARD05]. Most programming languages have means for encoding and decoding floating-point values to and from their binary representation in their standard libaries.

字符串和可变大小的数据

Strings and Variable-Size Data

全部原始数字类型具有固定大小。将更复杂的值组合在一起很像C 中的struct2。您可以将原始值组合成结构并使用固定大小的数组或指向其他内存区域的指针。

All primitive numeric types have a fixed size. Composing more complex values together is much like struct2 in C. You can combine primitive values into structures and use fixed-size arrays or pointers to other memory regions.

字符串和其他可变大小的数据类型(例如固定大小数据的数组)可以序列化为数字,表示数组或字符串的长度,后跟字节size:实际数据。对于字符串,这种表示形式通常称为UCSD StringPascal String,以 Pascal 编程语言的流行实现命名。我们可以用伪代码表示如下:

Strings and other variable-size data types (such as arrays of fixed-size data) can be serialized as a number, representing the length of the array or string, followed by size bytes: the actual data. For strings, this representation is often called UCSD String or Pascal String, named after the popular implementation of the Pascal programming language. We can express it in pseudocode as follows:

细绳
{
    大小 uint_16
    数据字节[大小]
}
String
{
    size    uint_16
    data    byte[size]
}

一个Pascal 字符串的替代方案是空终止字符串,其中读取器按字节使用字符串,直到到达字符串结尾符号。Pascal 字符串方法有几个优点:它允许在恒定时间内找出字符串的长度,而不是迭代字符串内容,并且可以通过从内存中切片字节并将字节数组传递给字符串来组成特定于语言的size字符串构造函数。

An alternative to Pascal strings is null-terminated strings, where the reader consumes the string byte-wise until the end-of-string symbol is reached. The Pascal string approach has several advantages: it allows finding out a length of a string in constant time, instead of iterating through string contents, and a language-specific string can be composed by slicing size bytes from memory and passing the byte array to a string constructor.

位打包数据:布尔值、枚举和标志

Bit-Packed Data: Booleans, Enums, and Flags

布尔值可以通过使用单个字节或编码truefalseas10值来表示。由于布尔值只有两个值,因此使用整个字节来表示是浪费的,并且开发人员通常将布尔值以八个为一组进行批处理,每个布尔值仅占用一位。我们说每1一位都已设置,每0一位都未设置为空

Booleans can be represented either by using a single byte, or encoding true and false as 1 and 0 values. Since a boolean has only two values, using an entire byte for its representation is wasteful, and developers often batch boolean values together in groups of eight, each boolean occupying just one bit. We say that every 1 bit is set and every 0 bit is unset or empty.

枚举,短对于枚举类型,可以表示为整数,通常用于二进制格式和通信协议中。枚举用于表示经常重复的低基数值。例如,我们可以使用枚举对 B 树节点类型进行编码:

Enums, short for enumerated types, can be represented as integers and are often used in binary formats and communication protocols. Enums are used to represent often-repeated low-cardinality values. For example, we can encode a B-Tree node type using an enum:

枚举节点类型{
   根,//0x00h
   内部,//0x01h
   叶 // 0x02h
};
enum NodeType {
   ROOT,     // 0x00h
   INTERNAL, // 0x01h
   LEAF      // 0x02h
};

其他密切相关的概念是标志,一种打包布尔值和枚举的组合。标志可以表示非互斥的命名布尔参数。例如,我们可以使用标志来表示页面是否包含值单元格、值是固定大小还是可变大小以及是否存在与该节点关联的溢出页面。由于每个位都代表一个标志值,因此我们只能使用二次方值作为掩码(因为二进制中的二次方始终有一个设置位;例如 、 等):23 == 8 == 1000b24 == 16 == 0001 0000b

Another closely related concept is flags, kind of a combination of packed booleans and enums. Flags can represent nonmutually exclusive named boolean parameters. For example, we can use flags to denote whether or not the page holds value cells, whether the values are fixed-size or variable-size, and whether or not there are overflow pages associated with this node. Since every bit represents a flag value, we can only use power-of-two values for masks (since powers of two in binary always have a single set bit; for example, 23 == 8 == 1000b, 24 == 16 == 0001 0000b, etc.):

int IS_LEAF_MASK = 0x01h;// 位#1
int VARIABLE_SIZE_VALUES = 0x02h;// 位#2
int HAS_OVERFLOW_PAGES = 0x04h;// 位#3
int IS_LEAF_MASK         = 0x01h; // bit #1
int VARIABLE_SIZE_VALUES = 0x02h; // bit #2
int HAS_OVERFLOW_PAGES   = 0x04h; // bit #3

只是与打包布尔值一样,可以从打包值中读取和写入标志值使用位掩码和按位运算符。例如,为了设置负责其中一个标志的位,我们可以使用按位OR( |) 和位掩码。我们可以使用bitshift ( <<) 和位索引来代替位掩码。要取消设置该位,我们可以使用按位AND( &) 和按位求反运算符 ( ~)。要测试该位是否n已设置,我们可以将按位的结果AND0

Just like packed booleans, flag values can be read and written from the packed value using bitmasks and bitwise operators. For example, in order to set a bit responsible for one of the flags, we can use bitwise OR (|) and a bitmask. Instead of a bitmask, we can use bitshift (<<) and a bit index. To unset the bit, we can use bitwise AND (&) and the bitwise negation operator (~). To test whether or not the bit n is set, we can compare the result of a bitwise AND with 0:

// 设置该位
标志 |= HAS_OVERFLOW_PAGES;
标志 |= (1 << 2);

// 取消该位的设置
标志 &= ~HAS_OVERFLOW_PAGES;
标志 &= ~(1 << 2);

// 测试该位是否已设置
is_set = (flags & HAS_OVERFLOW_PAGES) != 0;
is_set = (标志 & (1 << 2)) != 0;
// Set the bit
flags |= HAS_OVERFLOW_PAGES;
flags |= (1 << 2);

// Unset the bit
flags &= ~HAS_OVERFLOW_PAGES;
flags &= ~(1 << 2);

// Test whether or not the bit is set
is_set = (flags & HAS_OVERFLOW_PAGES) != 0;
is_set = (flags & (1 << 2)) != 0;

一般原则

General Principles

通常,你通过决定如何完成寻址来开始设计文件格式:文件是否将被分割成相同大小的页面,这些页面由单个块或多个连续块表示。大多数就地更新存储结构都使用相同大小的页面,因为它显着简化了读写访问。仅追加存储结构通常也按页写入数据:记录一个接一个地追加,一旦内存中的页面填满,就会将其刷新到磁盘上。

Usually, you start designing a file format by deciding how the addressing is going to be done: whether the file is going to be split into same-sized pages, which are represented by a single block or multiple contiguous blocks. Most in-place update storage structures use pages of the same size, since it significantly simplifies read and write access. Append-only storage structures often write data page-wise, too: records are appended one after the other and, as soon as the page fills up in memory, it is flushed on disk.

文件通常以固定大小的标头开始,并可能以固定大小的尾部结束,其中包含应快速访问或解码文件其余部分所需的辅助信息。文件的其余部分被分成几页。图 3-3示意性地显示了该文件的组织结构。

The file usually starts with a fixed-size header and may end with a fixed-size trailer, which hold auxiliary information that should be accessed quickly or is required for decoding the rest of the file. The rest of the file is split into pages. Figure 3-3 shows this file organization schematically.

数据库0303
图 3-3。文件组织

许多数据存储都有固定的模式,指定表可以容纳的字段的数量、顺序和类型。拥有固定的模式有助于减少存储在磁盘上的数据量:我们可以使用它们的位置标识符,而不是重复写入字段名称

Many data stores have a fixed schema, specifying the number, order, and type of fields the table can hold. Having a fixed schema helps to reduce the amount of data stored on disk: instead of repeatedly writing field names, we can use their positional identifiers.

如果我们想设计一种公司目录格式,存储每个员工的姓名、出生日期、税号和性别,我们可以使用多种方法。我们可以将固定大小的字段(例如出生日期和税号)存储在结构体的头部,然后存储可变大小的字段:

If we wanted to design a format for the company directory, storing names, birth dates, tax numbers, and genders for each employee, we could use several approaches. We could store the fixed-size fields (such as birth date and tax number) in the head of the structure, followed by the variable-size ones:

固定大小的字段:
| (4 字节)employee_id |
| (4 字节)tax_number |
| (3 个字节)日期 |
| (1 字节) 性别 |
| (2 个字节)名字长度 |
| (2 个字节)last_name_length |

可变大小字段:
| (名字长度字节)名字 |
| (姓氏长度字节)姓氏|
Fixed-size fields:
| (4 bytes) employee_id                |
| (4 bytes) tax_number                 |
| (3 bytes) date                       |
| (1 byte)  gender                     |
| (2 bytes) first_name_length          |
| (2 bytes) last_name_length           |

Variable-size fields:
| (first_name_length bytes) first_name |
| (last_name_length bytes) last_name   |

现在,要访问first_name,我们可以first_name_length在固定大小区域之后对字节进行切片。要访问last_name,我们可以通过检查其前面的可变大小字段的大小来定位其起始位置。为了避免涉及多个字段的计算,我们可以将偏移量长度编码到固定大小的区域。在这种情况下,我们可以单独定位任何可变大小的字段。

Now, to access first_name, we can slice first_name_length bytes after the fixed-size area. To access last_name, we can locate its starting position by checking the sizes of the variable-size fields that precede it. To avoid calculations involving multiple fields, we can encode both offset and length to the fixed-size area. In this case, we can locate any variable-size field separately.

构建更复杂的结构通常涉及构建层次结构:由基元组成的字段、由字段组成的单元格、由单元格组成的页面、由页面组成的部分、由部分组成的区域等等。这里没有必须遵循的严格规则,这完全取决于您需要为其创建格式的数据类型。

Building more complex structures usually involves building hierarchies: fields composed out of primitives, cells composed of fields, pages composed of cells, sections composed of pages, regions composed of sections, and so on. There are no strict rules you have to follow here, and it all depends on what kind of data you need to create a format for.

数据库文件通常由多个部分组成,其中有一个查找表帮助导航并指向这些部分的起始偏移量,这些部分写入文件头、尾部或单独的文件中。

Database files often consist of multiple parts, with a lookup table aiding navigation and pointing to the start offsets of these parts written either in the file header, trailer, or in the separate file.

页面结构

Page Structure

数据库系统将数据记录存储在数据文件和索引文件中。这些文件被分区为固定大小的单元,称为页面,其大小通常是多个文件系统块的大小。页面大小通常为 4 到 16 Kb。

Database systems store data records in data and index files. These files are partitioned into fixed-size units called pages, which often have a size of multiple filesystem blocks. Page sizes usually range from 4 to 16 Kb.

让我们来看看以磁盘上的 B 树节点为例。从结构的角度来看,在 B 树中,我们区分保存键和数据记录对的叶节点和保存键和指向其他节点的指针的非叶节点。每个 B 树节点占用一页或链接在一起的多个页面,因此在 B 树上下文中,术语“节点”和“页” (甚至“块”)通常可以互换使用。

Let’s take a look at the example of an on-disk B-Tree node. From a structure perspective, in B-Trees, we distinguish between the leaf nodes that hold keys and data records pairs, and nonleaf nodes that hold keys and pointers to other nodes. Each B-Tree node occupies one page or multiple pages linked together, so in the context of B-Trees the terms node and page (and even block) are often used interchangeably.

最初的 B-Tree 论文[BAYER72]描述了一种固定大小数据记录的简单页面组织,其中每个页面只是三元组的串联,如图3-4所示:键用 表示k,关联值用 表示v,指向子页面的指针由 表示p

The original B-Tree paper [BAYER72] describes a simple page organization for fixed-size data records, where each page is just a concatenation of triplets, as shown in Figure 3-4: keys are denoted by k, associated values are denoted by v, and pointers to child pages are denoted by p.

数据库0304
图 3-4。固定大小记录的页面组织

这种方法很容易遵循,但有一些缺点:

This approach is easy to follow, but has some downsides:

  • 在除右侧之外的任何位置添加键都需要重新定位元素。

  • Appending a key anywhere but the right side requires relocating elements.

  • 它不允许有效地管理或访问可变大小的记录,并且仅适用于固定大小的数据。

  • It doesn’t allow managing or accessing variable-size records efficiently and works only for fixed-size data.

开槽页

Slotted Pages

什么时候存储可变大小的记录,主要问题是空闲空间管理:回收被删除的记录占用的空间。如果我们尝试将 size 的记录n放入先前由 size 的记录占用的空间中m,除非m == n或者我们可以找到另一个具有完全相同大小的记录m – n,否则该空间将保持未使用状态。类似地,如果大于 ,则size 的段m不能用于存储 size 的记录,因此将插入该段而不回收未使用的空间。kkm

When storing variable-size records, the main problem is free space management: reclaiming the space occupied by removed records. If we attempt to put a record of size n into the space previously occupied by the record of size m, unless m == n or we can find another record that has a size exactly m – n, this space will remain unused. Similarly, a segment of size m cannot be used to store a record of size k if k is larger than m, so it will be inserted without reclaiming the unused space.

为了简化可变大小记录的空间管理,我们可以将页面分割成固定大小的段。然而,如果我们这样做,最终也会浪费空间。例如,如果我们使用 64 字节的段大小,除非记录大小是 64 的倍数,否则我们会浪费64 - (n modulo 64)字节,其中n是插入记录的大小。换句话说,除非记录是 64 的倍数,否则其中一个块将仅被部分填充。

To simplify space management for variable-size records, we can split the page into fixed-size segments. However, we end up wasting space if we do that, too. For example, if we use a segment size of 64 bytes, unless the record size is a multiple of 64, we waste 64 - (n modulo 64) bytes, where n is the size of the inserted record. In other words, unless the record is a multiple of 64, one of the blocks will be only partially filled.

空间回收可以通过简单地重写页面并移动记录来完成,但我们需要保留记录偏移量,因为页外指针可能正在使用这些偏移量。最好在做到这一点的同时尽量减少空间浪费。

Space reclamation can be done by simply rewriting the page and moving the records around, but we need to preserve record offsets, since out-of-page pointers might be using these offsets. It is desirable to do that while minimizing space waste, too.

总而言之,我们需要一种页面格式,使我们能够:

To summarize, we need a page format that allows us to:

  • 以最小的开销存储可变大小的记录。

  • Store variable-size records with a minimal overhead.

  • 回收被删除的记录占用的空间。

  • Reclaim space occupied by the removed records.

  • 引用页面中的记录,而不考虑其确切位置。

  • Reference records in the page without regard to their exact locations.

有效地存储可变大小的记录,例如字符串、二进制大对象(BLOB)等,我们可以使用称为槽页(即带有槽的页)的组织技术[SILBERSCHATZ10] 插槽目录 [RAMAKRISHNAN03]。许多数据库都使用这种方法,例如PostgreSQL

To efficiently store variable-size records such as strings, binary large objects (BLOBs), etc., we can use an organization technique called slotted page (i.e., a page with slots) [SILBERSCHATZ10] or slot directory [RAMAKRISHNAN03]. This approach is used by many databases, for example, PostgreSQL.

我们将页面组织成槽单元的集合,并将指针和单元分成位于页面不同侧的两个独立内存区域。这意味着我们只需要重新组织寻址单元格的指针即可保持顺序,并且可以通过使指针无效或删除它来删除记录。

We organize the page into a collection of slots or cells and split out pointers and cells in two independent memory regions residing on different sides of the page. This means that we only need to reorganize pointers addressing the cells to preserve the order, and deleting a record can be done either by nullifying its pointer or removing it.

开槽页面有一个固定大小的标题,其中包含有关页面和单元格的重要信息(请参阅“页面标题”)。单元格的大小可能不同,并且可以保存任意数据:键、指针、数据记录等。图 3-5显示了分槽页组织,其中每个页都有一个维护区域(标头)、单元格和指向它们的指针。

A slotted page has a fixed-size header that holds important information about the page and cells (see “Page Header”). Cells may differ in size and can hold arbitrary data: keys, pointers, data records, etc. Figure 3-5 shows a slotted page organization, where every page has a maintenance region (header), cells, and pointers to them.

数据库0305
图 3-5。开槽页

让我们看看这种方法如何解决我们在本节开头提到的问题

Let’s see how this approach fixes the problems we stated in the beginning of this section:

  • 最小开销:分槽页产生的唯一开销是一个指针数组,该数组保存了存储记录的确切位置的偏移量。

  • Minimal overhead: the only overhead incurred by slotted pages is a pointer array holding offsets to the exact positions where the records are stored.

  • 空间回收:可以通过碎片整理和重写页面来回收空间。

  • Space reclamation: space can be reclaimed by defragmenting and rewriting the page.

  • 动态布局:从页面外部,插槽仅通过其 ID 引用,因此确切位置位于页面内部。

  • Dynamic layout: from outside the page, slots are referenced only by their IDs, so the exact location is internal to the page.

单元布局

Cell Layout

使用标志、枚举和原始值,我们可以开始设计单元格布局,然后将单元格组合成页面,并从页面组成一棵树。在单元格级别上,我们区分键单元格和键值单元格。键单元格包含一个分隔符键和一个指向两个相邻指针之间的页面的指针。键值单元格保存与其关联的键和数据记录。

Using flags, enums, and primitive values, we can start designing the cell layout, then combine cells into pages, and compose a tree out of the pages. On a cell level, we have a distinction between key and key-value cells. Key cells hold a separator key and a pointer to the page between two neighboring pointers. Key-value cells hold keys and data records associated with them.

我们假设页面中的所有单元格都是统一的(例如,所有单元格可以仅保存键或同时保存键和值;类似地,所有单元格保存固定大小或可变大小的数据,但不能混合两者)。这意味着我们可以在页面级别存储描述单元格的元数据一次,而不是在每个单元格中复制它。

We assume that all cells within the page are uniform (for example, all cells can hold either just keys or both keys and values; similarly, all cells hold either fixed-size or variable-size data, but not a mix of both). This means we can store metadata describing cells once on the page level, instead of duplicating it in every cell.

要组成关键单元格,我们需要知道:

To compose a key cell, we need to know:

  • 单元格类型(可以从页面元数据推断)

  • Cell type (can be inferred from the page metadata)

  • 钥匙尺寸

  • Key size

  • 该单元格指向的子页面的 ID

  • ID of the child page this cell is pointing to

  • 关键字节

  • Key bytes

可变大小的键单元格布局可能看起来像这样(固定大小的键单元格布局在单元格级别上没有大小说明符):

A variable-size key cell layout might look something like this (a fixed-size one would have no size specifier on the cell level):

0 4 8
+----------------+----------------+-------------+
| [int] 密钥大小 | [int] page_id | [字节] 键 |
+----------------+----------------+-------------+
0                4               8
+----------------+---------------+-------------+
| [int] key_size | [int] page_id | [bytes] key |
+----------------+---------------+-------------+

我们将固定大小的数据字段分组在一起,然后是key_size字节。这不是绝对必要的,但可以简化偏移量计算,因为所有固定大小的字段都可以通过使用静态的、预先计算的偏移量来访问,并且我们只需要计算可变大小数据的偏移量。

We have grouped fixed-size data fields together, followed by key_size bytes. This is not strictly necessary but can simplify offset calculation, since all fixed-size fields can be accessed by using static, precomputed offsets, and we need to calculate the offsets only for the variable-size data.

键值单元格保存数据记录而不是子页面 ID。否则,它们的结构是相似的:

The key-value cells hold data records instead of the child page IDs. Otherwise, their structure is similar:

  • 单元格类型(可以从页面元数据推断)

  • Cell type (can be inferred from page metadata)

  • 钥匙尺寸

  • Key size

  • 价值大小

  • Value size

  • 关键字节

  • Key bytes

  • 数据记录字节

  • Data record bytes

0 1 5 ...
+--------------+----------------+
| [字节] 标志 | [int] 密钥大小 |
+--------------+----------------+

5 9 .. + key_size
+------------------+--------------------+--------- -------------+
| [int] 值大小 | [字节] 键 | [字节] data_record |
+------------------+--------------------+--------- -------------+
0              1                5 ...
+--------------+----------------+
| [byte] flags | [int] key_size |
+--------------+----------------+

5                  9                    .. + key_size
+------------------+--------------------+----------------------+
| [int] value_size |     [bytes] key    | [bytes] data_record  |
+------------------+--------------------+----------------------+

您可能已经注意到其中的区别此处的偏移量页面 ID之间。由于页面具有固定大小并由页面缓存管理(请参阅“缓冲区管理”),因此我们只需要存储页面 ID,稍后使用查找表将其转换为文件中的实际偏移量。单元格偏移量page-local 和相对于页面起始偏移量:这样我们可以使用更小的基数整数来保持表示更紧凑。

You might have noticed the distinction between the offset and page ID here. Since pages have a fixed size and are managed by the page cache (see “Buffer Management”), we only need to store the page ID, which is later translated to the actual offset in the file using the lookup table. Cell offsets are page-local and are relative to the page start offset: this way we can use a smaller cardinality integer to keep the representation more compact.

将单元组合成分槽页

Combining Cells into Slotted Pages

将单元格组织成页面,我们可以使用“页面结构”中讨论的分槽页面技术。我们将单元格附加到页面的右侧(靠近页面末尾),并将单元格偏移量/指针保留在页面的左侧,如图3-6所示。

To organize cells into pages, we can use the slotted page technique that we discussed in “Page Structure”. We append cells to the right side of the page (toward its end) and keep cell offsets/pointers in the left side of the page, as shown in Figure 3-6.

数据库0306
图 3-6。偏移和细胞生长方向

键可以乱序插入,并且通过按键顺序对单元格偏移指针进行排序来保持其逻辑排序顺序。这种设计允许以最小的努力将单元格附加到页面,因为在插入、更新或删除操作期间不必重新定位单元格。

Keys can be inserted out of order and their logical sorted order is kept by sorting cell offset pointers in key order. This design allows appending cells to the page with minimal effort, since cells don’t have to be relocated during insert, update, or delete operations.

让我们考虑一个包含名称的页面示例。页面中添加了两个名字,他们的插入顺序是:TomLeslie如图 3-7所示,它们的逻辑顺序(在本例中为字母顺序)与插入顺序(它们附加到页面的顺序)匹配。单元格按插入顺序排列,但偏移量重新排序以允许使用二分搜索。

Let’s consider an example of a page that holds names. Two names are added to the page, and their insertion order is: Tom and Leslie. As you can see in Figure 3-7, their logical order (in this case, alphabetical), does not match insertion order (order in which they were appended to the page). Cells are laid out in insertion order, but offsets are re-sorted to allow using binary search.

数据库0307
图 3-7。按随机顺序附加的记录:Tom、Leslie

现在,我们想在此页面上再添加一个名字:Ron。新数据附加在页面可用空间的上边界,但单元格偏移必须保留字典键顺序:LeslieRonTom。为此,我们必须重新排序单元格偏移:插入点之后的指针向右移动,为指向 Ron 单元格的新指针腾出空间,如图3-8所示。

Now, we’d like to add one more name to this page: Ron. New data is appended at the upper boundary of the free space of the page, but cell offsets have to preserve the lexicographical key order: Leslie, Ron, Tom. To do that, we have to reorder cell offsets: pointers after the insertion point are shifted to the right to make space for the new pointer to the Ron cell, as you can see in Figure 3-8.

数据库0308
图 3-8。再追加一项记录:Ron

管理可变大小的数据

Managing Variable-Size Data

去除页面中的项目不必删除实际的单元格并移动其他单元格以重新占用释放的空间。相反,可以将该单元标记为已删除,并且可以使用已释放的内存量和指向已释放值的指针来更新内存中的可用性列表。可用性列表存储已释放段的偏移量及其大小。插入新单元格时,我们首先检查可用性列表以查找是否有适合它的段。您可以在图 3-9中看到带有可用段的碎片页面示例。

Removing an item from the page does not have to remove the actual cell and shift other cells to reoccupy the freed space. Instead, the cell can be marked as deleted and an in-memory availability list can be updated with the amount of freed memory and a pointer to the freed value. The availability list stores offsets of freed segments and their sizes. When inserting a new cell, we first check the availability list to find if there’s a segment where it may fit. You can see an example of the fragmented page with available segments in Figure 3-9.

数据库0309
图 3-9。碎片化的页面和可用性列表。占用的页面显示为灰色。虚线表示指向可用性列表中未占用内存区域的指针。

SQLite 调用未被占用对空闲块进行分段并存储指向页眉中第一个空闲块的指针。此外,它还存储页面内的可用字节总数,以便在对页面进行碎片整理后快速检查是否可以将新元素放入页面中。

SQLite calls unoccupied segments freeblocks and stores a pointer to the first freeblock in the page header. Additionally, it stores a total number of available bytes within the page to quickly check whether or not we can fit a new element into the page after defragmenting it.

合身根据策略计算:

Fit is calculated based on the strategy:

首次适配
First fit

可能会导致更大的开销,因为重用第一个合适的段后剩余的空间可能太小而无法容纳任何其他单元,因此它将被有效地浪费。

This might cause a larger overhead, since the space remaining after reusing the first suitable segment might be too small to fit any other cell, so it will be effectively wasted.

最合适
Best fit

为了获得最佳拟合,我们尝试找到插入留下最小余数的段。

For best fit, we try to find a segment for which insertion leaves the smallest remainder.

如果我们找不到足够的连续字节来容纳新单元,但有足够的碎片字节可用,则会读取并重写活动单元,对页面进行碎片整理并回收空间以进行新的写入。如果碎片整理后仍然没有足够的可用空间,我们必须创建一个溢出页面(请参阅“溢出页面”)。

If we cannot find enough consecutive bytes to fit the new cell but there are enough fragmented bytes available, live cells are read and rewritten, defragmenting the page and reclaiming space for new writes. If there’s not enough free space even after defragmentation, we have to create an overflow page (see “Overflow Pages”).

笔记

为了提高局部性(特别是当键尺寸较小时),某些实现将键和值分别存储在叶级别上。将键放在一起可以改善搜索过程中的局部性。找到搜索到的键后,可以在具有相应索引的值单元格中找到其值。对于可变大小的键,这需要我们计算并存储附加值单元格指针。

To improve locality (especially when keys are small in size), some implementations store keys and values separately on the leaf level. Keeping keys together can improve the locality during the search. After the searched key is located, its value can be found in a value cell with a corresponding index. With variable-size keys, this requires us to calculate and store an additional value cell pointer.

综上所述,为了简化 B-Tree 布局,我们假设每个节点占用一个页面。页由固定大小的页眉、单元指针块和单元组成。单元格保存指向代表子节点或关联数据记录的页面的键和指针。B 树使用简单的指针层次结构:页面标识符用于定位树文件中的子节点,单元偏移量用于定位页面内的单元。

In summary, to simplify B-Tree layout, we assume that each node occupies a single page. A page consists of a fixed-size header, cell pointer block, and cells. Cells hold keys and pointers to the pages representing child nodes or associated data records. B-Trees use simple pointer hierarchies: page identifiers to locate the child nodes in the tree file, and cell offsets to locate cells within the page.

版本控制

Versioning

数据库系统不断发展,开发人员致力于添加功能,并修复错误和性能问题。因此,二进制文件格式可能会发生变化。大多数时候,任何存储引擎版本都必须支持不止一种序列化格式(例如,当前格式和一种或多种传统格式以实现向后兼容性)。为了支持这一点,我们必须能够找出我们所面对的文件版本。

Database systems constantly evolve, and developers work to add features, and to fix bugs and performance issues. As a result of that, the binary file format can change. Most of the time, any storage engine version has to support more than one serialization format (e.g., current and one or more legacy formats for backward compatibility). To support that, we have to be able to find out which version of the file we’re up against.

这可以通过多种方式来完成。例如,Apache Cassandra 在文件名中使用版本前缀。这样,您甚至无需打开文件就可以知道该文件的版本。从版本 4.0 开始,数据文件名具有前缀na,例如na-1-big-Data.db。较旧的文件具有不同的前缀:3.0 版本编写的文件具有该ma前缀。

This can be done in several ways. For example, Apache Cassandra is using version prefixes in filenames. This way, you can tell which version the file has without even opening it. As of version 4.0, a data file name has the na prefix, such as na-1-big-Data.db. Older files have different prefixes: files written in version 3.0 have the ma prefix.

或者,版本可以存储在单独的文件中。为了例如,PostgreSQL将版本存储在PG_VERSION文件中。

Alternatively, the version can be stored in a separate file. For example, PostgreSQL stores the version in the PG_VERSION file.

版本也可以直接存储在索引文件头中。在这种情况下,标头的一部分(或整个标头)必须以在版本之间不改变的格式进行编码。在找出文件的编码版本后,我们可以创建一个特定于版本的阅读器来解释内容。某些文件格式使用幻数来标识版本,我们在“幻数”中更详细地讨论。

The version can also be stored directly in the index file header. In this case, a part of the header (or an entire header) has to be encoded in a format that does not change between versions. After finding out which version the file is encoded with, we can create a version-specific reader to interpret the contents. Some file formats identify the version using magic numbers, which we discuss in more detail in “Magic Numbers”.

校验和

Checksumming

文件磁盘上的文件可能会因软件错误和硬件故障而损坏或损坏。为了预先识别这些问题并避免将损坏的数据传播到其他子系统甚至节点,我们可以使用校验和和循环冗余校验(CRC)。

Files on disk may get damaged or corrupted by software bugs and hardware failures. To identify these problems preemptively and avoid propagating corrupt data to other subsystems or even nodes, we can use checksums and cyclic redundancy checks (CRCs).

一些资料来源没有区分加密和非加密哈希函数、CRC 和校验和。它们的共同点是,将大量数据减少为少量数据,但它们的用例、目的和保证是不同的。

Some sources make no distinction between cryptographic and noncryptographic hash functions, CRCs, and checksums. What they all have in common is that they reduce a large chunk of data to a small number, but their use cases, purposes, and guarantees are different.

校验和提供最弱的保证形式,并且无法检测多个位的损坏。它们通常通过XOR奇偶校验或求和[KOOPMAN15]来计算。

Checksums provide the weakest form of guarantee and aren’t able to detect corruption in multiple bits. They’re usually computed by using XOR with parity checks or summation [KOOPMAN15].

CRC 可以帮助检测突发错误(例如,当多个连续位被损坏时),并且它们的实现通常使用查找表和多项式除法[STONE98]。多位错误对于检测至关重要,因为通信网络和存储设备中很大一部分故障都是以这种方式表现出来的。

CRCs can help detect burst errors (e.g., when multiple consecutive bits got corrupted) and their implementations usually use lookup tables and polynomial division [STONE98]. Multibit errors are crucial to detect, since a significant percentage of failures in communication networks and storage devices manifest this way.

警告

非加密哈希值和 CRC 不应用于验证数据是否已被篡改。为此,您应该始终使用为安全性而设计的强加密哈希。CRC 的主要目标是确保数据不会发生意外和意外的更改。这些算法并不是为了抵御攻击和故意更改数据而设计的。

Noncryptographic hashes and CRCs should not be used to verify whether or not the data has been tampered with. For this, you should always use strong cryptographic hashes designed for security. The main goal of CRC is to make sure that there were no unintended and accidental changes in data. These algorithms are not designed to resist attacks and intentional changes in data.

在将数据写入磁盘之前,我们计算其校验和并将其与数据一起写入。当读回时,我们再次计算校验和并将其与写入的进行比较。如果存在校验和不匹配,我们就知道发生了损坏,并且我们不应该使用读取的数据。

Before writing the data on disk, we compute its checksum and write it together with the data. When reading it back, we compute the checksum again and compare it with the written one. If there’s a checksum mismatch, we know that corruption has occurred and we should not use the data that was read.

由于计算整个文件的校验和通常是不切实际的,并且我们不太可能每次访问它时都读取整个内容,因此页面校验和通常在页面上计算并放置在页眉中。这样,校验和可以更加稳健(因为它们是在数据的一个小子集上执行的),并且如果损坏包含在单个页面中,则不必丢弃整个文件。

Since computing a checksum over the whole file is often impractical and it is unlikely we’re going to read the entire content every time we access it, page checksums are usually computed on pages and placed in the page header. This way, checksums can be more robust (since they are performed on a small subset of the data), and the whole file doesn’t have to be discarded if corruption is contained in a single page.

概括

Summary

在本章中,我们学习了二进制数据组织:如何序列化原始数据类型,将它们组合成单元格,从单元格中构建分槽页面,以及导航这些结构。

In this chapter, we learned about binary data organization: how to serialize primitive data types, combine them into cells, build slotted pages out of cells, and navigate these structures.

我们学习了如何处理可变大小的数据类型,例如字符串、字节序列和数组,以及如何组成保存其中包含的值大小的特殊单元格。

We learned how to handle variable-size data types such as strings, byte sequences, and arrays, and compose special cells that hold a size of values contained in them.

我们讨论了分槽页面格式,它允许我们通过单元 ID 从页面外部引用各个单元,按插入顺序存储记录,并通过对单元偏移进行排序来保留键顺序。

We discussed the slotted page format, which allows us to reference individual cells from outside the page by cell ID, store records in the insertion order, and preserve the key order by sorting cell offsets.

这些原理可用于构建磁盘结构和网络协议的二进制格式。

These principles can be used to compose binary formats for on-disk structures and network protocols.

1根据平台(macOS、Solaris、Aix 或 BSD 风格之一或 Windows),该kLittleEndian变量设置为平台是否支持小端。

1 Depending on the platform (macOS, Solaris, Aix, or one of the BSD flavors, or Windows), the kLittleEndian variable is set to whether or not the platform supports little-endian.

2值得注意的是,编译器可以向结构添加填充,这也取决于体系结构。这可能会打破关于确切字节偏移和位置的假设。您可以在此处阅读有关结构打包的更多信息:https ://databass.dev/links/58 。

2 It’s worth noting that compilers can add padding to structures, which is also architecture dependent. This may break the assumptions about the exact byte offsets and locations. You can read more about structure packing here: https://databass.dev/links/58.

第 4 章实现 B 树

Chapter 4. Implementing B-Trees

在上一章中,我们讨论了二进制格式组合的一般原理,并学习了如何创建单元格、构建层次结构以及使用指针将它们连接到页面。这些概念适用于就地更新和仅附加存储结构。在本章中,我们讨论 B 树特有的一些概念。

In the previous chapter, we talked about general principles of binary format composition, and learned how to create cells, build hierarchies, and connect them to pages using pointers. These concepts are applicable for both in-place update and append-only storage structures. In this chapter, we discuss some concepts specific to B-Trees.

本章中的各部分分为三个逻辑组。首先,我们讨论组织:如何建立键和指针之间的关系,以及如何实现页面之间的标题和链接。

The sections in this chapter are split into three logical groups. First, we discuss organization: how to establish relationships between keys and pointers, and how to implement headers and links between pages.

接下来,我们讨论从根到叶下降期间发生的过程,即如何执行二分搜索以及如何收集面包屑并跟踪父节点,以防以后必须拆分或合并节点。

Next, we discuss processes that occur during root-to-leaf descends, namely how to perform binary search and how to collect breadcrumbs and keep track of parent nodes in case we later have to split or merge nodes.

最后,我们讨论优化技术(重新平衡、仅右追加和批量加载)、维护过程和垃圾收集。

Lastly, we discuss optimization techniques (rebalancing, right-only appends, and bulk loading), maintenance processes, and garbage collection.

页眉

Page Header

页面标题保存有关可用于导航、维护和优化的页面的信息。它通常包含描述页面内容和布局的标志、页面中的单元格数量、标记空白空间的上下偏移量(用于附加单元格偏移量和数据)以及其他有用的元数据。

The page header holds information about the page that can be used for navigation, maintenance, and optimizations. It usually contains flags that describe page contents and layout, number of cells in the page, lower and upper offsets marking the empty space (used to append cell offsets and data), and other useful metadata.

为了例如,PostgreSQL将页面大小和布局版本存储在标头中。在MySQL InnoDB中,页头保存堆记录数、级别和其他一些特定于实现的值。在SQLite中,页头存储单元格的数量和最右边的指针。

For example, PostgreSQL stores the page size and layout version in the header. In MySQL InnoDB, page header holds the number of heap records, level, and some other implementation-specific values. In SQLite, page header stores the number of cells and a rightmost pointer.

神奇数字

Magic Numbers

通常放置在文件或页眉中的值是一个幻数。通常,它是一个多字节块,包含一个常量值,可用于指示该块代表一个页面、指定其类型或标识其版本。

One of the values often placed in the file or page header is a magic number. Usually, it’s a multibyte block, containing a constant value that can be used to signal that the block represents a page, specify its kind, or identify its version.

幻数经常用于验证和健全性检查[GIAMPAOLO98]。随机偏移处的字节序列与幻数完全匹配是非常不可能的。如果确实匹配,则偏移量很可能是正确的。例如,为了验证页面是否已正确加载和对齐,在写入期间,我们可以将幻数50 41 47 45(十六进制PAGE)放入页眉中。在读取过程中,我们通过将读取标头中的四个字节与预期字节序列进行比较来验证页面。

Magic numbers are often used for validation and sanity checks [GIAMPAOLO98]. It’s very improbable that the byte sequence at a random offset would exactly match the magic number. If it did match, there’s a good chance the offset is correct. For example, to verify that the page is loaded and aligned correctly, during write we can place the magic number 50 41 47 45 (hex for PAGE) into the header. During the read, we validate the page by comparing the four bytes from the read header with the expected byte sequence.

最右指针

Rightmost Pointers

B 树分隔符键具有严格的不变量:它们用于将树拆分为子树并导航它们,因此指向子页面的指针总是比键多一个。这就是“计数钥匙” +1中提到的出处。

B-Tree separator keys have strict invariants: they’re used to split the tree into subtrees and navigate them, so there is always one more pointer to child pages than there are keys. That’s where the +1 mentioned in “Counting Keys” is coming from.

“分隔符键”中,我们描述了分隔符键不变量。在许多实现中,节点看起来更像图 4-2中显示的节点:每个分隔符键都有一个子指针,而最后一个指针是单独存储的,因为它不与任何键配对。您可以将此与图 2-10进行比较。

In “Separator Keys”, we described separator key invariants. In many implementations, nodes look more like the ones displayed in Figure 4-2: each separator key has a child pointer, while the last pointer is stored separately, since it’s not paired with any key. You can compare this to Figure 2-10.

数据库0402
图 4-2。最右指针

额外的指针可以存储在标头中,例如,它是在SQLite中实现的。

This extra pointer can be stored in the header as, for example, it is implemented in SQLite.

如果最右边的子单元被拆分并且新单元格附加到其父单元,则必须重新分配最右边的子单元指针。如图4-3所示,分裂后,附加到父级的单元格(以灰色显示)保存提升的键并指向分裂节点。分配指向新节点的指针而不是先前最右边的指针。SQLite 中描述并实现了类似的方法。1

If the rightmost child is split and the new cell is appended to its parent, the rightmost child pointer has to be reassigned. As shown in Figure 4-3, after the split, the cell appended to the parent (shown in gray) holds the promoted key and points to the split node. The pointer to the new node is assigned instead of the previous rightmost pointer. A similar approach is described and implemented in SQLite.1

数据库0403
图 4-3。节点分裂期间最右指针更新。升级后的密钥显示为灰色。

节点高调

Node High Keys

我们可以采取稍微不同的方法,将最右边的指针与节点高键一起存储在单元格中。高键表示当前节点下的子树中可以存在的最高可能键。这种方法的使用者是PostgreSQL 被称为B link -Trees(有关此方法的并发影响,请参阅“Blink-Trees”)。

We can take a slightly different approach and store the rightmost pointer in the cell along with the node high key. The high key represents the highest possible key that can be present in the subtree under the current node. This approach is used by PostgreSQL and is called Blink-Trees (for concurrency implications of this approach, see “Blink-Trees”).

B 树有N键(用 表示)和指针(用 表示)。在每个子树中,键以 为界。是隐式的并且不存在于节点中。KiN + 1PiKi-1 ≤ Ks < KiK0 = -∞

B-Trees have N keys (denoted with Ki) and N + 1 pointers (denoted with Pi). In each subtree, keys are bounded by Ki-1 ≤ Ks < Ki. The K0 = -∞ is implicit and is not present in the node.

B link - 树为每个节点添加一个键。它指定可以存储在指针指向的子树中的键的上限,因此也是可以存储在当前子树中的值的上限。两种方法如图 4-4所示:(a) 显示没有高密钥的节点,(b) 显示具有高密钥的节点。KN+1PN

Blink-Trees add a KN+1 key to each node. It specifies an upper bound of keys that can be stored in the subtree to which the pointer PN points, and therefore is an upper bound of values that can be stored in the current subtree. Both approaches are shown in Figure 4-4: (a) shows a node without a high key, and (b) shows a node with a high key.

数据库0404
图 4-4。不带 (a) 和带 (b) 高键的 B 树

在这种情况下,指针可以成对存储,并且每个单元格可以有一个相应的指针,这可能会简化最右边的指针处理,因为没有那么多的边缘情况需要考虑。

In this case, pointers can be stored pairwise, and each cell can have a corresponding pointer, which might simplify rightmost pointer handling as there are not as many edge cases to consider.

图 4-5中,您可以看到两种方法的示意性页面结构,以及在这些情况下如何以不同方式分割搜索空间:在第一种情况下向上到,在第二种情况下 +∞向上到上限。K3

In Figure 4-5, you can see schematic page structure for both approaches and how the search space is split differently for these cases: going up to +∞ in the first case, and up to the upper bound of K3 in the second.

数据库0405
图 4-5。使用 +∞ 作为虚拟键 (a) 与存储高键 (b)

溢出页

Overflow Pages

节点大小树扇出值是固定的,不会动态变化。提出一个普遍最佳的值也很困难:如果树中存在可变大小的值并且它们足够大,则只有其中少数可以适合页面。如果值很小,我们最终会浪费保留空间。

Node size and tree fanout values are fixed and do not change dynamically. It would also be difficult to come up with a value that would be universally optimal: if variable-size values are present in the tree and they are large enough, only a few of them can fit into the page. If the values are tiny, we end up wasting the reserved space.

B 树算法指定每个节点保留特定数量的项目。由于某些值具有不同的大小,因此根据 B-Tree 算法,我们最终可能会遇到这样的情况:节点尚未,但保存该节点的固定大小页面上没有更多可用空间。调整页面大小需要将已写入的数据复制到新区域,这通常是不切实际的。然而,我们仍然需要找到一种方法来增加或扩展页面大小。

The B-Tree algorithm specifies that every node keeps a specific number of items. Since some values have different sizes, we may end up in a situation where, according to the B-Tree algorithm, the node is not full yet, but there’s no more free space on the fixed-size page that holds this node. Resizing the page requires copying already written data to the new region and is often impractical. However, we still need to find a way to increase or extend the page size.

为了实现可变大小的节点而不将数据复制到新的连续区域,我们可以从多个链接页面构建节点。例如,默认页面大小为 4 K,插入几个值后,其数据大小已增长超过 4 K。节点不允许任意大小,而是允许以 4 K 增量增长,因此我们分配 4 K 扩展页面并链接到原始页面。这些链接的页面扩展称为溢出页面。为了清楚起见,我们称原始页面本节范围内的主页。

To implement variable-size nodes without copying data to the new contiguous region, we can build nodes from multiple linked pages. For example, the default page size is 4 K, and after inserting a few values, its data size has grown over 4 K. Instead of allowing arbitrary sizes, nodes are allowed to grow in 4 K increments, so we allocate a 4 K extension page and link it from the original one. These linked page extensions are called overflow pages. For clarity, we call the original page the primary page in the scope of this section.

大多数 B-Tree 实现仅允许直接在 B-Tree 节点中存储最多固定数量的有效负载字节,并将其余字节溢出到溢出页。该值是通过将节点大小除以扇出来计算的。使用这种方法,我们不会最终遇到页面没有可用空间的情况,因为它总是至少有max_payload_size字节。为了有关 SQLite 中溢出页面的更多信息,请参阅SQLite 源代码存储库;还可以查看MySQL InnoDB 文档

Most B-Tree implementations allow storing only up to a fixed number of payload bytes in the B-Tree node directly and spilling the rest to the overflow page. This value is calculated by dividing the node size by fanout. Using this approach, we cannot end up in a situation where the page has no free space, as it will always have at least max_payload_size bytes. For more information on overflow pages in SQLite, see the SQLite source code repository; also check out the MySQL InnoDB documentation.

当插入的有效负载大于 时max_payload_size,将检查该节点是否已经具有任何关联的溢出页面。如果溢出页面已经存在并且有足够的可用空间,则有效负载中的额外字节会溢出到那里。否则,分配新的溢出页。

When the inserted payload is larger than max_payload_size, the node is checked for whether or not it already has any associated overflow pages. If an overflow page already exists and has enough space available, extra bytes from the payload are spilled there. Otherwise, a new overflow page is allocated.

图 4-6中,您可以看到一个主页面和一个溢出页面,其中的记录从主页面指向溢出页面,其负载继续在溢出页面上。

In Figure 4-6, you can see a primary page and an overflow page with records pointing from the primary page to the overflow one, where their payload continues.

数据库0406
图 4-6。溢出页面

溢出页面需要一些额外的簿记,因为它们可能会像主页面一样碎片化,并且我们必须能够回收该空间来写入新数据,或者在不再需要时丢弃溢出页面。

Overflow pages require some extra bookkeeping, since they may get fragmented as well as primary pages, and we have to be able to reclaim this space to write new data, or discard the overflow page if it’s not needed anymore.

当第一个溢出页被分配时,它的页ID被存储在主页的页头中。如果单个溢出页不够,则通过将下一个溢出页 ID 存储在前一个溢出页的页眉中,将多个溢出页链接在一起。可能需要遍历几个页面才能找到给定有效负载的溢出部分

When the first overflow page is allocated, its page ID is stored in the header of the primary page. If a single overflow page is not enough, multiple overflow pages are linked together by storing the next overflow page ID in the previous one’s header. Several pages may have to be traversed to locate the overflow part for the given payload.

由于键通常具有高基数,因此存储键的一部分是有意义的,因为大多数比较可以在驻留在主页中的修剪后的键部分上进行。

Since keys usually have high cardinality, storing a portion of a key makes sense, as most of the comparisons can be made on the trimmed key part that resides in the primary page.

对于数据记录,我们必须找到其溢出部分以将其返回给用户。然而,这并不重要,因为这是一个不频繁的操作。如果所有数据记录都过大,则值得考虑专门针对大值进行 Blob 存储。

For data records, we have to locate their overflow parts to return them to the user. However, this doesn’t matter much, since it’s an infrequent operation. If all data records are oversize, it is worth considering specialized blob storage for large values.

二分查找

Binary Search

我们已经已经讨论了 B 树查找算法(请参阅“B 树查找算法” ),并提到我们使用二分搜索算法在节点内定位搜索到的键。二分查找适用于已排序的数据。如果键没有排序,则无法进行二分搜索。这就是为什么保持键有序并维护排序不变量至关重要。

We’ve already discussed the B-Tree lookup algorithm (see “B-Tree Lookup Algorithm”) and mentioned that we locate a searched key within the node using the binary search algorithm. Binary search works only for sorted data. If keys are not ordered, they can’t be binary searched. This is why keeping keys in order and maintaining a sorted invariant is essential.

二分搜索算法接收已排序项的数组和搜索到的键,并返回一个数字。如果返回的数字是正数,我们就知道找到了搜索的键,并且该数字指定了它在输入数组中的位置。负返回值表示搜索的键不存在于输入数组中并给我们一个插入点

The binary search algorithm receives an array of sorted items and a searched key, and returns a number. If the returned number is positive, we know that the searched key was found and the number specifies its position in the input array. A negative return value indicates that the searched key is not present in the input array and gives us an insertion point.

插入点是大于给定键的第一个元素的索引。该数字的绝对值是可以插入搜索关键字以保持顺序的索引。插入可以通过从插入点开始将元素移动到一个位置来完成,以便为插入的元素腾出空间[SEDGEWICK11]

The insertion point is the index of the first element that is greater than the given key. An absolute value of this number is the index at which the searched key can be inserted to preserve order. Insertion can be done by shifting elements over one position, starting from an insertion point, to make space for the inserted element [SEDGEWICK11].

大多数较高级别的搜索不会产生完全匹配的结果,并且我们对搜索方向感兴趣,在这种情况下,我们必须找到第一个大于搜索值的值,并按照相应的子链接进入相关的子链接。子树。

The majority of searches on higher levels do not result in exact matches, and we’re interested in the search direction, in which case we have to find the first value that is greater than the searched one and follow the corresponding child link into the associated subtree.

使用间接指针的二分查找

Binary Search with Indirection Pointers

细胞B-Tree 页面中的元素按插入顺序存储,只有单元格偏移量保留逻辑元素顺序。要通过页面单元格执行二分搜索,我们选择中间的单元格偏移量,跟随其指针定位该单元格,将该单元格中的键与搜索到的键进行比较,以决定是否应该继续向左或向右搜索,并递归地继续此过程,直到找到查找到的元素或插入点,如图4-7所示。

Cells in the B-Tree page are stored in the insertion order, and only cell offsets preserve the logical element order. To perform binary search through page cells, we pick the middle cell offset, follow its pointer to locate the cell, compare the key from this cell with the searched key to decide whether the search should continue left or right, and continue this process recursively until the searched element or the insertion point is found, as shown in Figure 4-7.

传播分裂和合并

Propagating Splits and Merges

作为我们在前面的章节中讨论过,B 树的拆分和合并可以传播到更高的级别。为此,我们需要能够从分裂叶子或一对合并叶子遍历链回到根节点。

As we’ve discussed in previous chapters, B-Tree splits and merges can propagate to higher levels. For that, we need to be able to traverse a chain back to the root node from the splitting leaf or a pair of merging leaves.

B-Tree 节点可以包括父节点指针。由于来自较低级别的页面在从较高级别引用时总是会被调入,因此甚至没有必要将此信息保留在磁盘上。

B-Tree nodes may include parent node pointers. Since pages from lower levels are always paged in when they’re referenced from a higher level, it is not even necessary to persist this information on disk.

就像兄弟指针一样(参见“兄弟链接”),只要父指针发生变化,父指针就必须更新。当带有页面标识符的分隔符键从一个节点传输到另一个节点时,在所有情况下都会发生这种情况:在父节点分裂、合并或重新平衡父节点期间。

Just like sibling pointers (see “Sibling Links”), parent pointers have to be updated whenever the parent changes. This happens in all the cases when the separator key with the page identifier is transferred from one node to another: during the parent node splits, merges, or rebalancing of the parent node.

一些实现(例如,WiredTiger)使用父指针进行叶遍历以避免死锁,这在使用兄弟指针时可能会发生(请参阅[MILLER78][LEHMAN81])。该算法不使用同级指针来遍历叶节点,而是使用父指针,就像我们在图 4-1中看到的那样。

Some implementations (for example, WiredTiger) use parent pointers for leaf traversal to avoid deadlocks, which may happen when using sibling pointers (see [MILLER78], [LEHMAN81]). Instead of using sibling pointers to traverse leaf nodes, the algorithm employs parent pointers, much like we saw in Figure 4-1.

为了寻址和定位兄弟节点,我们可以沿着父节点的指针递归下降到较低级别。每当我们遍历共享父节点的所有兄弟节点后到达父节点的末尾时,搜索就会继续向上递归,最终到达根节点并继续向下返回到叶级别。

To address and locate a sibling, we can follow a pointer from the parent node and recursively descend back to the lower level. Whenever we reach the end of the parent node after traversing all the siblings sharing the parent, the search continues upward recursively, eventually reaching up to the root and continuing back down to the leaf level.

面包屑

Breadcrumbs

反而通过存储和维护父节点指针,可以跟踪在到目标叶节点的路径上遍历的节点,并在插入期间级联拆分或删除期间合并的情况下以相反的顺序跟踪父节点链。

Instead of storing and maintaining parent node pointers, it is possible to keep track of nodes traversed on the path to the target leaf node, and follow the chain of parent nodes in reverse order in case of cascading splits during inserts, or merges during deletes.

在可能导致B-Tree结构改变的操作(插入或删除)中,我们首先从根到叶遍历树,找到目标节点和插入点。由于我们并不总是预先知道该操作是否会导致拆分或合并(至少在找到目标叶节点之前不会),因此我们必须收集面包

During operations that may result in structural changes of the B-Tree (insert or delete), we first traverse the tree from the root to the leaf to find the target node and the insertion point. Since we do not always know up front whether or not the operation will result in a split or merge (at least not until the target leaf node is located), we have to collect breadcrumbs.

面包屑包含对从根开始的节点的引用,并用于在传播拆分或合并时反向回溯它们。最自然的数据结构是堆栈。例如,PostgreSQL 将面包屑存储在堆栈中,内部引用为 BTStack。2

Breadcrumbs contain references to the nodes followed from the root and are used to backtrack them in reverse when propagating splits or merges. The most natural data structure for this is a stack. For example, PostgreSQL stores breadcrumbs in a stack, internally referenced as BTStack.2

如果节点被拆分或合并,则面包屑可用于查找拉到父级的键的插入点,并在必要时返回树以将结构更改传播到更高级别的节点。该堆栈保存在内存中。

If the node is split or merged, breadcrumbs can be used to find insertion points for the keys pulled to the parent and to walk back up the tree to propagate structural changes to the higher-level nodes, if necessary. This stack is maintained in memory.

图 4-8显示了根到叶遍历的示例,收集包含指向已访问节点和单元索引的指针的面包屑。如果目标叶节点被拆分,则弹出堆栈顶部的项目以找到其直接父节点。如果父节点有足够的空间,则会在面包屑中的单元索引处向其追加一个新单元(假设该索引仍然有效)。否则,父节点也会被分裂。这个过程递归地继续下去,直到堆栈为空并且我们到达了根,或者级别上没有分裂。

Figure 4-8 shows an example of root-to-leaf traversal, collecting breadcrumbs containing pointers to the visited nodes and cell indices. If the target leaf node is split, the item on top of the stack is popped to locate its immediate parent. If the parent node has enough space, a new cell is appended to it at the cell index from the breadcrumb (assuming the index is still valid). Otherwise, the parent node is split as well. This process continues recursively until either the stack is empty and we have reached the root, or there was no split on the level.

数据库0408
图 4-8。查找期间收集的面包屑,包含遍历的节点和单元索引。虚线表示到访问节点的逻辑链接。面包屑表中的数字表示后续子指针的索引。

再平衡

Rebalancing

一些B-Tree 实现尝试推迟拆分和合并操作,通过重新平衡关卡内的元素或在最终执行拆分或合并之前将元素从占用较多的节点尽可能长时间地移动到占用较少的节点来分摊其成本。这有助于提高节点占用率,并可能减少树内的级别数量,但重新平衡的维护成本可能更高。

Some B-Tree implementations attempt to postpone split and merge operations to amortize their costs by rebalancing elements within the level, or moving elements from more occupied nodes to less occupied ones for as long as possible before finally performing a split or merge. This helps to improve node occupancy and may reduce the number of levels within the tree at a potentially higher maintenance cost of rebalancing.

可以在插入和删除操作期间执行负载平衡[GRAEFE11]。为了提高空间利用率,我们可以将一些元素转移到兄弟节点之一并为插入腾出空间,而不是在溢出时分裂节点。类似地,在删除过程中,我们可以选择从相邻节点中移动一些元素,以确保该节点至少是半满的,而不是合并兄弟节点。

Load balancing can be performed during insert and delete operations [GRAEFE11]. To improve space utilization, instead of splitting the node on overflow, we can transfer some of the elements to one of the sibling nodes and make space for the insertion. Similarly, during delete, instead of merging the sibling nodes, we may choose to move some of the elements from the neighboring nodes to ensure the node is at least half full.

B*-Tree 不断在相邻节点之间分配数据,直到两个同级节点都已满[KNUTH98]。然后,该算法不是将单个节点拆分为两个半空节点,而是将两个节点拆分为三个节点,每个节点都是三分之二满的。SQLite 在实现中使用了这个变体。这种方法通过推迟拆分来提高平均占用率,但需要额外的跟踪和平衡逻辑。更高的利用率还意味着更高效的搜索,因为树的高度更小,并且在到达搜索叶子的路径上必须遍历的页面更少。

B*-Trees keep distributing data between the neighboring nodes until both siblings are full [KNUTH98]. Then, instead of splitting a single node into two half-empty ones, the algorithm splits two nodes into three nodes, each of which is two-thirds full. SQLite uses this variant in the implementation. This approach improves an average occupancy by postponing splits, but requires additional tracking and balancing logic. Higher utilization also means more efficient searches, because the height of the tree is smaller and fewer pages have to be traversed on the path to the searched leaf.

图 4-9显示了相邻节点之间的元素分布,其中左侧同级节点包含的元素多于右侧节点。占用较多的节点中的元素将移动到占用较少的节点。由于平衡会改变兄弟节点的最小/最大不变量,因此我们必须更新父节点处的键和指针以保留它。

Figure 4-9 shows distributing elements between the neighboring nodes, where the left sibling contains more elements than the right one. Elements from the more occupied node are moved to the less occupied one. Since balancing changes the min/max invariant of the sibling nodes, we have to update keys and pointers at the parent node to preserve it.

数据库0409
图 4-9。B-Tree 平衡:在占用较多的节点和占用较少的节点之间分配元素

负载平衡是许多数据库实现中使用的有用技术。例如,SQLite 实现了balance-siblings算法,这有点接近我们在本节中描述的算法。平衡可能会增加代码的一些复杂性,但由于其用例是隔离的,因此可以在稍后阶段将其实现为优化。

Load balancing is a useful technique used in many database implementations. For example, SQLite implements the balance-siblings algorithm, which is somewhat close to what we have described in this section. Balancing might add some complexity to the code, but since its use cases are isolated, it can be implemented as an optimization at a later stage.

仅右追加

Right-Only Appends

许多数据库系统使用自动递增单调递增值作为主索引键。这种情况为优化提供了机会,因为所有插入都发生在索引的末尾(最右边的叶子中),因此大多数分割发生在每个级别的最右边的节点上。此外,由于键是单调递增的,考虑到附加与更新和删除的比率较低,非叶页也比随机排序键的情况更少碎片。

Many database systems use auto-incremented monotonically increasing values as primary index keys. This case opens up an opportunity for an optimization, since all the insertions are happening toward the end of the index (in the rightmost leaf), so most of the splits occur on the rightmost node on each level. Moreover, since the keys are monotonically incremented, given that the ratio of appends versus updates and deletes is low, nonleaf pages are also less fragmented than in the case of randomly ordered keys.

PostgreSQL将这种情况称为快速路径。当插入的key严格大于最右边页中的第一个key,并且最右边页有足够的空间容纳新插入的条目时,新条目被插入到缓存的最右边叶子中的适当位置,并且整个读取路径可以跳过。

PostgreSQL is calling this case a fastpath. When the inserted key is strictly greater than the first key in the rightmost page, and the rightmost page has enough space to hold the newly inserted entry, the new entry is inserted into the appropriate location in the cached rightmost leaf, and the whole read path can be skipped.

SQLite有一个类似的概念,并将其称为“quickbalance”。当条目插入到最右端并且目标节点已满(即,插入后成为树中最大的条目)时,它不会重新平衡或分裂节点,而是分配新的最右边节点并将其指针添加到父级(有关在 SQLite 中实现平衡的更多信息,请参阅“重新平衡”)。即使这使得新创建的页面几乎为空(而不是在节点分裂的情况下半空),该节点很可能很快就会被填满。

SQLite has a similar concept and calls it quickbalance. When the entry is inserted on the far right end and the target node is full (i.e., it becomes the largest entry in the tree upon insertion), instead of rebalancing or splitting the node, it allocates the new rightmost node and adds its pointer to the parent (for more on implementing balancing in SQLite, see “Rebalancing”). Even though this leaves the newly created page nearly empty (instead of half empty in the case of a node split), it is very likely that the node will get filled up shortly.

批量装载

Bulk Loading

如果我们已经预先排序了数据并想要批量加载它,或者必须重建树(例如,为了碎片整理),我们可以进一步采用仅右附加的想法。由于创建树所需的数据已经排序,因此在批量加载期间,我们只需将项目附加到树中最右侧的位置即可。

If we have presorted data and want to bulk load it, or have to rebuild the tree (for example, for defragmentation), we can take the idea with right-only appends even further. Since the data required for tree creation is already sorted, during bulk loading we only need to append the items at the rightmost location in the tree.

在这种情况下,我们可以完全避免拆分和合并,而是从下往上构建树,逐级写出,或者一旦有足够的指向已写入的较低级别节点的指针,就写出较高级别的节点。

In this case, we can avoid splits and merges altogether and compose the tree from the bottom up, writing it out level by level, or writing out higher-level nodes as soon as we have enough pointers to already written lower-level nodes.

实现批量加载的一种方法是在叶级逐页写入预排序数据(而不是插入单个元素)。写入叶页后,我们将其第一个键传播到父级,并使用正常算法来构建更高的 B 树级别[RAMAKRISHNAN03]。由于附加键是按排序顺序给出的,因此这种情况下的所有拆分都发生在最右边的节点上。

One approach for implementing bulk loading is to write presorted data on the leaf level page-wise (rather then inserting individual elements). After the leaf page is written, we propagate its first key to the parent and use a normal algorithm for building higher B-Tree levels [RAMAKRISHNAN03]. Since appended keys are given in the sorted order, all splits in this case occur on the rightmost node.

由于 B 树总是从底部(叶)级别开始构建,因此可以在组成任何更高级别的节点之前写出完整的叶级别。这允许在构建更高级别时拥有所有子指针。这种方法的主要好处是,我们不必在磁盘上执行任何拆分或合并,同时只需保留树的最小部分(即当前填充叶节点的所有父节点)。建造时的记忆。

Since B-Trees are always built starting from the bottom (leaf) level, the complete leaf level can be written out before any higher-level nodes are composed. This allows having all child pointers at hand by the time the higher levels are constructed. The main benefits of this approach are that we do not have to perform any splits or merges on disk and, at the same time, have to keep only a minimal part of the tree (i.e., all parents of the currently filling leaf node) in memory for the time of construction.

不可变 B 树可以用相同的方式创建,但与可变 B 树不同,它们不需要后续修改的空间开销,因为树上的所有操作都是最终的。所有页面都可以完全填满,从而提高占用率并获得更好的性能。

Immutable B-Trees can be created in the same manner but, unlike mutable B-Trees, they require no space overhead for subsequent modifications, since all operations on a tree are final. All pages can be completely filled up, improving occupancy and resulting into better performance.

压缩

Compression

储存原始的、未压缩的数据可能会产生大量的开销,许多数据库都提供了压缩数据以节省空间的方法。这里明显的权衡是访问速度和压缩比之间:较大的压缩比可以提高数据大小,允许您在单次访问中获取更多数据,但可能需要更多 RAM 和 CPU 周期来压缩和解压缩数据。

Storing the raw, uncompressed data can induce significant overhead, and many databases offer ways to compress it to save space. The apparent trade-off here is between access speed and compression ratio: larger compression ratios can improve data size, allowing you to fetch more data in a single access, but might require more RAM and CPU cycles to compress and decompress it.

压缩可以在不同的粒度级别进行。尽管压缩整个文件可以产生更好的压缩比,但它的应用有限,因为更新时必须重新压缩整个文件,并且更细粒度的压缩通常更适合较大的数据集。压缩整个索引文件既不切实际,也难以有效实现:为了寻址特定页面,必须访问整个文件(或其包含压缩元数据的部分)(以便找到压缩部分)、解压缩并使其可用。

Compression can be done at different granularity levels. Even though compressing entire files can yield better compression ratios, it has limited application as a whole file has to be recompressed on an update, and more granular compression is usually better-suited for larger datasets. Compressing an entire index file is both impractical and hard to implement efficiently: to address a particular page, the whole file (or its section containing compression metadata) has to be accessed (in order to locate a compressed section), decompressed, and made available.

另一种方法是按页压缩数据。它很适合我们的讨论,因为到目前为止我们讨论的算法使用固定大小的页面。页面可以彼此独立地压缩和解压缩,从而允许您将压缩与页面加载和刷新结合起来。然而,在这种情况下,压缩页只能占用磁盘块的一小部分,并且由于传输通常以磁盘块为单位完成,因此可能需要分页额外字节 [RAY95 ]。在图 4-10中,您可以看到压缩页 (a) 占用的空间比磁盘块少。当我们加载此页面时,我们还会分页属于其他页面的附加字节。对于跨越多个磁盘块的页面,如同一图像中的 (b),我们必须读取额外的块。

An alternative is to compress data page-wise. It fits our discussion well, since the algorithms we’ve been discussing so far use fixed-size pages. Pages can be compressed and uncompressed independently from one another, allowing you to couple compression with page loading and flushing. However, a compressed page in this case can occupy only a fraction of a disk block and, since transfers are usually done in units of disk blocks, it might be necessary to page in extra bytes [RAY95]. In Figure 4-10, you can see a compressed page (a) taking less space than the disk block. When we load this page, we also page in additional bytes that belong to the other page. With pages that span multiple disk blocks, like (b) in the same image, we have to read an additional block.

数据库0410
图 4-10。压缩和块填充

其他方法是仅压缩数据,按行(压缩整个数据记录)或按列(单独压缩列)。在这种情况下,页面管理和压缩是解耦的。

Another approach is to compress data only, either row-wise (compressing entire data records) or column-wise (compressing columns individually). In this case, page management and compression are decoupled.

撰写本书时回顾的大多数开源数据库都具有可插入的压缩方法,使用可用的库,例如SnappyzLiblz4等。

Most of the open source databases reviewed while writing this book have pluggable compression methods, using available libraries such as Snappy, zLib, lz4, and many others.

由于压缩算法会根据数据集和潜在目标(例如压缩比、性能或内存开销)产生不同的结果,因此我们不会在本书中讨论比较和实现细节。有许多概述可评估不同块大小的不同压缩算法(例如Squash Compression Benchmark),通常侧重于四个指标:内存开销、压缩性能、解压缩性能和压缩比。在选择压缩库时,这些指标非常重要。

As compression algorithms yield different results depending on a dataset and potential objectives (e.g., compression ratio, performance, or memory overhead), we will not go into comparison and implementation details in this book. There are many overviews available that evaluate different compression algorithms for different block sizes (for example, Squash Compression Benchmark), usually focusing on four metrics: memory overhead, compression performance, decompression performance, and compression ratio. These metrics are important to consider when picking a compression library.

真空和维护

Vacuum and Maintenance

所以到目前为止,我们主要讨论的是 B 树中面向用户的操作。但是,还有其他进程与查询并行发生,这些进程可维护存储完整性、回收空间、减少开销并保持页面有序。在后台执行这些操作可以让我们节省一些时间,并避免在插入、更新和删除期间付出清理的代价。

So far we’ve been mostly talking about user-facing operations in B-Trees. However, there are other processes that happen in parallel with queries that maintain storage integrity, reclaim space, reduce overhead, and keep pages in order. Performing these operations in the background allows us to save some time and avoid paying the price of cleanup during inserts, updates, and deletes.

所描述的设计开槽页面(请参阅“开槽页面”)需要对页面进行维护以保持其良好状态。例如,内部节点中的后续拆分和合并或叶级别上的插入、更新和删除可能会导致页面具有足够的逻辑空间,但没有足够的连续空间,因为它是碎片化的。图 4-11显示了这种情况的示例:页面仍然有一些可用的逻辑空间,但它是碎片化的,并且被分割在两个已删除的(垃圾)记录以及标题/单元格指针和单元格之间的一些剩余可用空间中。

The described design of slotted pages (see “Slotted Pages”) requires maintenance to be performed on pages to keep them in good shape. For example, subsequent splits and merges in internal nodes or inserts, updates, and deletes on the leaf level can result in a page that has enough logical space but does not have enough contiguous space, since it is fragmented. Figure 4-11 shows an example of such a situation: the page still has some logical space available, but it’s fragmented and is split between the two deleted (garbage) records and some remaining free space between the header/cell pointers and cells.

数据库0411
图 4-11。碎片页面的示例

B 树是从根级别导航的。通过从根节点向下追踪指针可以到达的数据记录是活动的(可寻址的)。不可寻址的数据记录被称为垃圾:这些记录在任何地方都没有被引用,无法读取或解释,因此它们的内容就等于无效。

B-Trees are navigated from the root level. Data records that can be reached by following pointers down from the root node are live (addressable). Nonaddressable data records are said to be garbage: these records are not referenced anywhere and cannot be read or interpreted, so their contents are as good as nullified.

您可以在图 4-11中看到这种区别:仍然具有指向它们的指针的单元格是可寻址的,这与已删除或覆盖的单元格不同。出于性能原因,垃圾区域的零填充通常会被跳过,因为最终这些区域无论如何都会被新数据覆盖。

You can see this distinction in Figure 4-11: cells that still have pointers to them are addressable, unlike the removed or overwritten ones. Zero-filling of garbage areas is often skipped for performance reasons, as eventually these areas are overwritten by the new data anyway.

更新和删除导致的碎片

Fragmentation Caused by Updates and Deletes

让我们考虑一下在这种情况下,页面会进入具有不可寻址数据并且必须进行压缩的状态。在叶级别,删除仅删除标题中的单元格偏移量,使单元格本身保持完整。完成此操作后,该单元格将不再可寻址,其内容将不会出现在查询结果中,并且无需将其无效或移动相邻单元格。

Let’s consider under which circumstances pages get into the state where they have nonaddressable data and have to be compacted. On the leaf level, deletes only remove cell offsets from the header, leaving the cell itself intact. After this is done, the cell is not addressable anymore, its contents will not appear in the query results, and nullifying it or moving neighboring cells is not necessary.

当页面被分割时,仅修剪偏移量,并且由于页面的其余部分不可寻址,因此偏移量被截断的单元格不可访问,因此每当新数据到达时它们将被覆盖,或者在真空处理时被垃圾收集开始。

When the page is split, only offsets are trimmed and, since the rest of the page is not addressable, cells whose offsets were truncated are not reachable, so they will be overwritten whenever the new data arrives, or garbage-collected when the vacuum process kicks in.

笔记

一些数据库依赖于垃圾收集,并保留已删除和已更新的单元以进行多版本并发控制(请参阅“多版本并发控制”)。在更新完成之前,并发执行的事务仍然可以访问单元,并且在没有其他线程访问它们时可以立即收集单元。一些数据库维护的结构跟踪幽灵记录,这些记录在所有可能已完成的交易完成后立即收集[WEIKUM01]

Some databases rely on garbage collection, and leave removed and updated cells in place for multiversion concurrency control (see “Multiversion Concurrency Control”). Cells remain accessible for the concurrently executing transactions until the update is complete, and can be collected as soon as no other thread accesses them. Some databases maintain structures that track ghost records, which are collected as soon as all transactions that may have seen them complete [WEIKUM01].

由于删除仅丢弃单元偏移量,并且不会重新定位剩余单元或物理删除目标单元以占用释放的空间,因此释放的字节可能最终分散在整个页面上。在这种情况下,我们说页面有碎片,需要进行碎片整理。

Since deletes only discard cell offsets and do not relocate remaining cells or physically remove the target cells to occupy the freed space, freed bytes might end up scattered across the page. In this case, we say that the page is fragmented and requires defragmentation.

为了进行写入,我们通常需要一个连续的空闲字节块来容纳单元。为了将释放的片段重新组合在一起并解决这种情况,我们必须重写页面。

To make a write, we often need a contiguous block of free bytes where the cell fits. To put the freed fragments back together and fix this situation, we have to rewrite the page.

插入操作将元组保留在其插入顺序中。这不会产生那么大的影响,但是自然排序的元组可以帮助在顺序读取期间进行缓存预取。

Insert operations leave tuples in their insertion order. This does not have as significant an impact, but having naturally sorted tuples can help with cache prefetch during sequential reads.

更新主要适用于叶级别:内部页面键用于引导导航并且仅定义子树边界。此外,更新是在每个键的基础上执行的,除了创建溢出页面之外,通常不会导致树中的结构变化。然而,在叶级别,更新操作不会更改单元顺序并尝试避免页面重写。这意味着单元的多个版本(其中只有一个是可寻址的)最终可能会被存储。

Updates are mostly applicable to the leaf level: internal page keys are used for guided navigation and only define subtree boundaries. Additionally, updates are performed on a per-key basis, and generally do not result in structural changes in the tree, apart from the creation of overflow pages. On the leaf level, however, update operations do not change cell order and attempt to avoid page rewrite. This means that multiple versions of the cell, only one of which is addressable, may end up being stored.

页面碎片整理

Page Defragmentation

负责空间回收和页面重写的过程称为压缩清理或简称维护。如果页面没有足够的可用物理空间(以避免创建不必要的溢出页面),则可以在写入时同步完成页面重写,但压缩通常被称为遍历页面、执行垃圾收集和重写的独特的异步过程他们的内容。

The process that takes care of space reclamation and page rewrites is called compaction, vacuum, or just maintenance. Page rewrites can be done synchronously on write if the page does not have enough free physical space (to avoid creating unnecessary overflow pages), but compaction is mostly referred to as a distinct, asynchronous process of walking through pages, performing garbage collection, and rewriting their contents.

此过程回收死单元占据的空间,并按单元的逻辑顺序重写单元。当页面被重写时,它们也可能被重新定位到文件中的新位置。未使用的内存页面变得可用并返回到页面缓存。ID新可用的磁盘页面被添加到空闲页面列表(有时称为空闲列表3)。必须保留此信息,以便在节点崩溃和重新启动时幸存下来,并确保可用空间不会丢失或泄漏。

This process reclaims the space occupied by dead cells, and rewrites cells in their logical order. When pages are rewritten, they may also get relocated to new positions in the file. Unused in-memory pages become available and are returned to the page cache. IDs of the newly available on-disk pages are added to the free page list (sometimes called a freelist3). This information has to be persisted to survive node crashes and restarts, and to make sure free space is not lost or leaked.

概括

Summary

在本章中,我们讨论了磁盘 B 树实现的特定概念,例如:

In this chapter, we discussed the concepts specific to on-disk B-Tree implementations, such as:

页眉
Page header

那里通常存储什么信息。

What information is usually stored there.

最右边的指针
Rightmost pointers

这些不与分隔键配对,以及如何处理它们。

These are not paired with separator keys, and how to handle them.

高调
High keys

确定节点中可以存储的最大允许密钥。

Determine the maximum allowed key that can be stored in the node.

溢出页面
Overflow pages

允许您使用固定大小的页面存储超大和可变大小的记录。

Allow you to store oversize and variable-size records using fixed-size pages.

之后,我们了解了与根到叶遍历相关的一些细节:

After that, we went through some details related to root-to-leaf traversals:

  • 如何使用间接指针执行二分查找

  • How to perform binary search with indirection pointers

  • 如何使用父指针或面包屑跟踪树层次结构

  • How to keep track of tree hierarchies using parent pointers or breadcrumbs

最后,我们进行了一些优化和维护技术:

Lastly, we went through some optimization and maintenance techniques:

再平衡
Rebalancing

在相邻节点之间移动元素以减少拆分和合并的次数。

Moves elements between neighboring nodes to reduce a number of splits and merges.

仅右追加
Right-only appends

附加新的最右侧单元格,而不是在假设它会很快填满的情况下将其拆分。

Appends the new rightmost cell instead of splitting it under the assumption that it will quickly fill up.

散装
Bulk loading

一种根据排序数据从头开始高效构建 B 树的技术。

A technique for efficiently building B-Trees from scratch from sorted data.

垃圾收集
Garbage collection

重写页面、按关键顺序放置单元并回收不可寻址单元占用的空间的过程。

A process that rewrites pages, puts cells in key order, and reclaims space occupied by unaddressable cells.

这些概念应该弥合基本 B 树算法和实际实现之间的差距,并帮助您更好地理解基于 B 树的存储系统的工作原理。

These concepts should bridge the gap between the basic B-Tree algorithm and a real-world implementation, and help you better understand how B-Tree–based storage systems work.

1您可以在项目存储库balance_deeper的函数中找到该算法。

1 You can find this algorithm in the balance_deeper function in the project repository.

2您可以在项目存储库中阅读有关它的更多信息:https ://databass.dev/links/21 。

2 You can read more about it in the project repository: https://databass.dev/links/21.

3例如,SQLite维护一个数据库不使用的页面列表,其中主干页面保存在链接列表中并保存释放页面的地址。

3 For example, SQLite maintains a list of pages that are not used by the database, where trunk pages are held in a linked list and hold addresses of freed pages.

第 5 章事务处理和恢复

Chapter 5. Transaction Processing and Recovery

在本书中,我们采用自下而上的方法来了解数据库系统概念:我们首先了解存储结构。现在,我们准备转向负责缓冲区管理、锁定管理和恢复的更高级别组件,这是理解数据库事务的先决条件。

In this book, we’ve taken a bottom-up approach to database system concepts: we first learned about storage structures. Now, we’re ready to move to the higher-level components responsible for buffer management, lock management, and recovery, which are the prerequisites for understanding database transactions.

一笔交易是一个数据库管理系统中不可分割的逻辑工作单元,允许您将多个操作表示为一个步骤。事务执行的操作包括读取和写入数据库记录。数据库事务必须保持原子性、一致性、隔离性和持久性。这些属性通常称为ACID [HAERDER83]

A transaction is an indivisible logical unit of work in a database management system, allowing you to represent multiple operations as a single step. Operations executed by transactions include reading and writing database records. A database transaction has to preserve atomicity, consistency, isolation, and durability. These properties are commonly referred as ACID [HAERDER83]:

原子性
Atomicity

交易步骤不可分割的,这意味着与事务关联的所有步骤要么成功执行,要么都不执行。换句话说,交易不应该被部分应用。每笔交易都可以要么提交(使事务期间执行的写操作的所有更改可见),要么中止(回滚所有尚未可见的事务副作用)。提交是最终操作。中止后,可以重试事务。

Transaction steps are indivisible, which means that either all the steps associated with the transaction execute successfully or none of them do. In other words, transactions should not be applied partially. Each transaction can either commit (make all changes from write operations executed during the transaction visible), or abort (roll back all transaction side effects that haven’t yet been made visible). Commit is a final operation. After an abort, the transaction can be retried.

一致性
Consistency

一致性是特定于应用程序的保证;事务应该只将数据库从一种有效状态转变为另一种有效状态,并维护所有数据库不变量(例如约束、引用完整性等)。一致性是定义最弱的属性,可能是因为它是唯一由用户控制而不仅仅是数据库本身控制的属性。

Consistency is an application-specific guarantee; a transaction should only bring the database from one valid state to another valid state, maintaining all database invariants (such as constraints, referential integrity, and others). Consistency is the most weakly defined property, possibly because it is the only property that is controlled by the user and not only by the database itself.

隔离
Isolation

多种的并发执行的事务应该能够不受干扰地运行,就好像没有其他事务同时执行一样。隔离性定义了数据库状态的更改何时可见,以及哪些更改可以对并发事务可见。出于性能原因,许多数据库使用的隔离级别弱于给定的隔离定义。根据用于并发控制的方法和方法,事务所做的更改可能对其他并发事务可见,也可能不可见(请参阅“隔离级别”)。

Multiple concurrently executing transactions should be able to run without interference, as if there were no other transactions executing at the same time. Isolation defines when the changes to the database state may become visible, and what changes may become visible to the concurrent transactions. Many databases use isolation levels that are weaker than the given definition of isolation for performance reasons. Depending on the methods and approaches used for concurrency control, changes made by a transaction may or may not be visible to other concurrent transactions (see “Isolation Levels”).

耐用性
Durability

一次事务已提交,所有数据库状态修改都必须保留在磁盘上,并且能够在断电、系统故障和崩溃时幸存下来。

Once a transaction has been committed, all database state modifications have to be persisted on disk and be able to survive power outages, system failures, and crashes.

在数据库系统中实现事务,除了在磁盘上组织和保存数据的存储结构之外,还需要多个组件协同工作。在本地节点上,事务管理器协调、调度和跟踪事务及其各个步骤。

Implementing transactions in a database system, in addition to a storage structure that organizes and persists data on disk, requires several components to work together. On the node locally, the transaction manager coordinates, schedules, and tracks transactions and their individual steps.

管理员守卫访问这些资源并防止违反数据完整性的并发访问。每当请求锁时,锁管理器都会检查该锁是否已被任何其他事务以共享或独占模式持有,如果请求的访问级别不产生矛盾,则授予对其的访问权限。由于排他锁在任何给定时刻最多可由一个事务持有,因此请求它们的其他事务必须等到锁被释放,或者中止并稍后重试。一旦锁被释放或每当事务终止时,锁管理器就会通知一个挂起的事务,让它获取锁并继续。

The lock manager guards access to these resources and prevents concurrent accesses that would violate data integrity. Whenever a lock is requested, the lock manager checks if it is already held by any other transaction in shared or exclusive mode, and grants access to it if the requested access level results in no contradiction. Since exclusive locks can be held by at most one transaction at any given moment, other transactions requesting them have to wait until locks are released, or abort and retry later. As soon as the lock is released or whenever the transaction terminates, the lock manager notifies one of the pending transactions, letting it acquire the lock and continue.

缓存充当持久存储(磁盘)和存储引擎其余部分之间的中介。它将状态更改暂存在主内存中,并充当尚未与持久存储同步的页面的缓存。对数据库状态的所有更改都会首先应用于缓存的页面。

The page cache serves as an intermediary between persistent storage (disk) and the rest of the storage engine. It stages state changes in main memory and serves as a cache for the pages that haven’t been synchronized with persistent storage. All changes to a database state are first applied to the cached pages.

日志管理器持有应用于缓存页面但尚未与持久存储同步的操作历史记录(日志条目),以保证它们在崩溃时不会丢失。换句话说,日志用于在启动期间重新应用这些操作并重建缓存的状态。日志条目还可用于撤消已中止事务所做的更改。

The log manager holds a history of operations (log entries) applied to cached pages but not yet synchronized with persistent storage to guarantee they won’t be lost in case of a crash. In other words, the log is used to reapply these operations and reconstruct the cached state during startup. Log entries can also be used to undo changes done by the aborted transactions.

分布式(多分区)事务需要额外的协调和远程执行。我们将在第 13 章讨论分布式事务协议。

Distributed (multipartition) transactions require additional coordination and remote execution. We discuss distributed transaction protocols in Chapter 13.

缓冲区管理

Buffer Management

大多数数据库是使用一个两级内存层次结构:较慢的持久存储(磁盘)和较快的主内存(RAM)。到减少对持久存储的访问次数,页面缓存在内存中。当存储层再次请求该页面时,将返回其缓存的副本。

Most databases are built using a two-level memory hierarchy: slower persistent storage (disk) and faster main memory (RAM). To reduce the number of accesses to persistent storage, pages are cached in memory. When the page is requested again by the storage layer, its cached copy is returned.

假设没有其他进程修改磁盘上的数据,则可以重用内存中可用的缓存页面。这种方法有时会被引用作为虚拟磁盘 [BAYER72]。仅当内存中没有可用的页面副本时,虚拟磁盘读取才会访问物理存储。同一概念的更常见名称页缓存缓冲池。页缓存负责将从磁盘读取的页缓存在内存中。如果数据库系统崩溃或无序关闭,缓存的内容就会丢失。

Cached pages available in memory can be reused under the assumption that no other process has modified the data on disk. This approach is sometimes referenced as virtual disk [BAYER72]. A virtual disk read accesses physical storage only if no copy of the page is already available in memory. A more common name for the same concept is page cache or buffer pool. The page cache is responsible for caching pages read from disk in memory. In case of a database system crash or unorderly shutdown, cached contents are lost.

由于术语“页缓存”更好地反映了此结构的用途,因此本书默认使用此名称。缓冲池一词听起来像是它的主要目的是池化和重用缓冲区,而不共享其内容,这可以是页面缓存的有用部分,甚至可以作为单独的组件,但不能准确反映整个目的。

Since the term page cache better reflects the purpose of this structure, this book defaults to this name. The term buffer pool sounds like its primary purpose is to pool and reuse empty buffers, without sharing their contents, which can be a useful part of a page cache or even as a separate component, but does not reflect the entire purpose as precisely.

缓存页面的问题并不限于数据库的范围。操作系统也有页面缓存的概念。操作系统利用未使用的内存段来透明地缓存磁盘内容,以提高 I/O 系统调用的性能。

The problem of caching pages is not limited in scope to databases. Operating systems have the concept of a page cache, too. Operating systems utilize unused memory segments to transparently cache disk contents to improve performance of I/O syscalls.

未缓存的页面据说从磁盘加载时会被分页。如果对缓存页面进行任何更改,则该页面被称为脏页面,直到这些更改被刷新回磁盘上。

Uncached pages are said to be paged in when they’re loaded from disk. If any changes are made to the cached page, it is said to be dirty, until these changes are flushed back on disk.

由于保存缓存页面的内存区域通常比整个数据集小得多,因此页面缓存最终会被填满,并且为了分页到新页面,缓存页面之一必须被驱逐

Since the memory region where cached pages are held is usually substantially smaller than an entire dataset, the page cache eventually fills up and, in order to page in a new page, one of the cached pages has to be evicted.

图 5-1中,您可以看到 B-Tree 页面的逻辑表示、其缓存版本以及磁盘上的页面之间的关系。页面缓存无序地将页面加载到空闲槽中,因此页面在磁盘和内存中的排序方式之间没有直接映射。

In Figure 5-1, you can see the relation between the logical representation of B-Tree pages, their cached versions, and the pages on disk. The page cache loads pages into free slots out of order, so there’s no direct mapping between how pages are ordered on disk and in memory.

数据库0501
图 5-1。页面缓存

页缓存的主要功能可以概括为:

The primary functions of a page cache can be summarized as:

  • 它将缓存的页面内容保留在内存中。

  • It keeps cached page contents in memory.

  • 它允许对磁盘页面的修改一起缓冲并根据其缓存版本执行。

  • It allows modifications to on-disk pages to be buffered together and performed against their cached versions.

  • 当内存中不存在所请求的页面并且有足够的可用空间时,页面缓存会对其进行分页,并返回其缓存版本。

  • When a requested page isn’t present in memory and there’s enough space available for it, it is paged in by the page cache, and its cached version is returned.

  • 如果请求已缓存的页面,则返回其缓存版本。

  • If an already cached page is requested, its cached version is returned.

  • 如果没有足够的空间可用于新页面,则会驱逐其他一些页面并将其内容刷新到磁盘。

  • If there’s not enough space available for the new page, some other page is evicted and its contents are flushed to disk.

缓存语义

Caching Semantics

全部对缓冲区所做的更改将保留在内存中,直到最终写回磁盘。由于不允许其他进程对备份文件进行更改,因此这种同步是一种单向过程:从内存到磁盘,反之亦然。页缓存允许数据库对内存管理和磁盘访问有更多的控制。您可以将其视为特定于应用程序的内核页面缓存的等效项:它直接访问块设备,实现类似的功能,并服务于类似的目的。它抽象磁盘访问并将逻辑写入操作与物理写入操作分离。

All changes made to buffers are kept in memory until they are eventually written back to disk. As no other process is allowed to make changes to the backing file, this synchronization is a one-way process: from memory to disk, and not vice versa. The page cache allows the database to have more control over memory management and disk accesses. You can think of it as an application-specific equivalent of the kernel page cache: it accesses the block device directly, implements similar functionality, and serves a similar purpose. It abstracts disk accesses and decouples logical write operations from the physical ones.

缓存页面有助于将树部分保留在内存中,而无需对算法进行额外更改并在内存中具体化对象。我们所要做的就是通过调用页面缓存来替换磁盘访问。

Caching pages helps to keep the tree partially in memory without making additional changes to the algorithm and materializing objects in memory. All we have to do is replace disk accesses by the calls to the page cache.

当存储引擎访问(即请求)页面时,我们首先检查其内容是否已缓存,如果已缓存,则返回缓存的页面内容。如果页面内容尚未缓存,则缓存将逻辑页面地址或页号转换为其物理地址,将其内容加载到内存中,并将其缓存版本返回到存储引擎。一次返回后,具有缓存页面内容的缓冲区被称为被引用,并且存储引擎必须将其交还给页面缓存或在完成后取消引用它。可以指示页面缓存通过固定页面来避免驱逐页面。

When the storage engine accesses (in other words, requests) the page, we first check if its contents are already cached, in which case the cached page contents are returned. If the page contents are not yet cached, the cache translates the logical page address or page number to its physical address, loads its contents in memory, and returns its cached version to the storage engine. Once returned, the buffer with cached page contents is said to be referenced, and the storage engine has to hand it back to the page cache or dereference it once it’s done. The page cache can be instructed to avoid evicting pages by pinning them.

如果该页面被修改(例如,向其附加了一个单元格),则该页面将被标记为脏。页面上设置的脏标志表示其内容与磁盘不同步,必须刷新以确保持久性。

If the page is modified (for example, a cell was appended to it), it is marked as dirty. A dirty flag set on the page indicates that its contents are out of sync with the disk and have to be flushed for durability.

缓存驱逐

Cache Eviction

保持填充的缓存很好:我们可以在不使用持久存储的情况下提供更多的读取,并且可以将更多的同一页写入缓冲在一起。然而,页面缓存的容量有限,为了服务新内容,旧页面迟早必须被驱逐。如果页面内容与磁盘同步(即已经刷新或从未修改)并且页面未固定或引用,则可以立即将其驱逐。肮脏的页面在被驱逐之前必须被刷新。当其他线程正在使用引用的页面时,不应将其逐出。

Keeping caches populated is good: we can serve more reads without going to persistent storage, and more same-page writes can be buffered together. However, the page cache has a limited capacity and, sooner or later, to serve the new contents, old pages have to be evicted. If page contents are in sync with the disk (i.e., were already flushed or were never modified) and the page is not pinned or referenced, it can be evicted right away. Dirty pages have to be flushed before they can be evicted. Referenced pages should not be evicted while some other thread is using them.

由于每次驱逐时触发刷新可能会降低性能,因此某些数据库使用单独的后台进程来循环访问可能被驱逐的脏页,从而更新其磁盘版本。例如,PostgreSQL 有一个后台刷新编写器就可以做到这一点。

Since triggering a flush on every eviction might be bad for performance, some databases use a separate background process that cycles through the dirty pages that are likely to be evicted, updating their disk versions. For example, PostgreSQL has a background flush writer that does just that.

其他要记住的重要属性是持久性:如果数据库崩溃,所有未刷新的数据都会丢失。为了确保所有更改都持续存在,请刷新由检查点流程进行协调。检查站进程控制预写日志(WAL)和页面缓存,并确保它们同步工作。只有与应用于已刷新的缓存页的操作相关联的日志记录才能从 WAL 中丢弃。在此过程完成之前,脏页无法被逐出。

Another important property to keep in mind is durability: if the database has crashed, all data that was not flushed is lost. To make sure that all changes are persisted, flushes are coordinated by the checkpoint process. The checkpoint process controls the write-ahead log (WAL) and page cache, and ensures that they work in lockstep. Only log records associated with operations applied to cached pages that were flushed can be discarded from the WAL. Dirty pages cannot be evicted until this process completes.

这意味着几个目标之间总是需要权衡:

This means there is always a trade-off between several objectives:

  • 推迟刷新以减少磁盘访问次数

  • Postpone flushes to reduce the number of disk accesses

  • 抢先刷新页面以允许快速驱逐

  • Preemptively flush pages to allow quick eviction

  • 以最佳顺序选择要驱逐和刷新的页面

  • Pick pages for eviction and flush in the optimal order

  • 将缓存大小保持在其内存范围内

  • Keep cache size within its memory bounds

  • 避免丢失数据,因为数据未持久保存到主存储中

  • Avoid losing the data as it is not persisted to the primary storage

我们探索了几种技术,帮助我们改进前三个特征,同时使我们保持在其他两个特征的范围内。

We explore several techniques that help us to improve the first three characteristics while keeping us within the boundaries of the other two.

锁定缓存中的页面

Locking Pages in Cache

拥有在每次读取或写入时执行磁盘 I/O 是不切实际的:后续读取可能会请求同一页,就像后续写入可能会修改同一页一样。由于 B 树向顶部变得“更窄”,因此大多数读取都会命中较高级别的节点(更接近根的节点)。分裂和合并最终也会传播到更高级别的节点。这意味着树中至少有一部分可以从缓存中显着受益。

Having to perform disk I/O on each read or write is impractical: subsequent reads may request the same page, just as subsequent writes may modify the same page. Since B-Tree gets “narrower” toward the top, higher-level nodes (ones that are closer to the root) are hit for most of the reads. Splits and merges also eventually propagate to the higher-level nodes. This means there’s always at least a part of a tree that can significantly benefit from being cached.

我们可以“锁定”最近一段时间内使用概率较高的页面。锁定页面在缓存中称为固定。固定页面在内存中保留更长时间,这有助于减少磁盘访问次数并提高性能[GRAEFE11]

We can “lock” pages that have a high probability of being used in the nearest time. Locking pages in the cache is called pinning. Pinned pages are kept in memory for a longer time, which helps to reduce the number of disk accesses and improve performance [GRAEFE11].

由于每个较低的 B 树节点级别比较高级别的节点具有指数级更多的节点,并且较高级别的节点仅代表树的一小部分,因此树的这一部分可以永久驻留在内存中,而其他部分可以在内存中进行分页。要求。这意味着,为了执行查询,我们不必进行磁盘访问(如“B树查找复杂性”h中所述,是树的高度),而只需访问较低级别的磁盘,不缓存哪些页面。h

Since each lower B-Tree node level has exponentially more nodes than the higher one, and higher-level nodes represent just a small fraction of the tree, this part of the tree can reside in memory permanently, and other parts can be paged in on demand. This means that, in order to perform a query, we won’t have to make h disk accesses (as discussed in “B-Tree Lookup Complexity”, h is the height of the tree), but only hit the disk for the lower levels, for which pages are not cached.

对子树执行的操作可能会导致相互矛盾的结构变化,例如,多个删除操作导致合并,然后写入导致分裂,反之亦然。对于从不同子树传播的结构变化也是如此(结构变化在时间上彼此接近,在树的不同部分中向上传播)。这些操作可以通过仅在内存中应用更改来缓冲在一起,这可以减少磁盘写入次数并分摊操作成本,因为只能执行一次写入而不是多次写入。

Operations performed against a subtree may result in structural changes that contradict each other—for example, multiple delete operations causing merges followed by writes causing splits, or vice versa. Likewise for structural changes that propagate from different subtrees (structural changes occurring close to each other in time, in different parts of the tree, propagating up). These operations can be buffered together by applying changes only in memory, which can reduce the number of disk writes and amortize the operation costs, since only one write can be performed instead of multiple writes.

页面替换

Page Replacement

什么时候达到缓存容量后,要加载新页面,必须驱逐旧页面。然而,除非我们驱逐最不可能很快再次访问的页面,否则我们最终可能会多次加载它们,即使我们可以一直将它们保留在内存中。我们需要找到一种方法来估计后续页面访问的可能性,以对此进行优化。

When cache capacity is reached, to load new pages, old ones have to be evicted. However, unless we evict pages that are least likely to be accessed again soon, we might end up loading them several times subsequently even though we could’ve just kept them in memory for all that time. We need to find a way to estimate the likelihood of subsequent page access to optimize this.

为此,我们可以说页面应该是根据逐出策略(有时也称为页面替换策略)逐出。它尝试查找最不可能很快再次访问的页面。当页面从缓存中被逐出时,新页面可以加载到其位置。

For this, we can say that pages should be evicted according to the eviction policy (also sometimes called the page-replacement policy). It attempts to find pages that are least likely to be accessed again any time soon. When the page is evicted from the cache, the new page can be loaded in its place.

为了使页缓存实现具有高性能,需要高效的页替换算法。理想的页面替换策略需要一个水晶球来预测页面将被访问的顺序,并仅驱逐最长时间不会被触及的页面。由于请求不一定遵循任何特定的模式或分布,因此精确预测行为可能很复杂,但使用正确的页面替换策略可以帮助减少驱逐次数。

For a page cache implementation to be performant, it needs an efficient page-replacement algorithm. An ideal page-replacement strategy would require a crystal ball that would predict the order in which pages are going to be accessed and evict only pages that will not be touched for the longest time. Since requests do not necessarily follow any specific pattern or distribution, precisely predicting behavior can be complicated, but using a right page replacement strategy can help to reduce the number of evictions.

我们可以通过简单地使用更大的缓存来减少驱逐次数,这似乎是合乎逻辑的。然而,情况似乎并非如此。证明这种困境的例子之一被称为Bélády 异常 [BEDALY69]。它表明,如果使用的页面替换算法不是最优的,则增加页面数量可能会增加驱逐数量。当可能很快需要的页面被逐出然后再次加载时,页面开始竞争缓存中的空间。因此,我们需要明智地考虑我们正在使用的算法,以便它能够改善情况,而不是让情况变得更糟。

It seems logical that we can reduce the number of evictions by simply using a larger cache. However, this does not appear to be the case. One of the examples demonstrating this dilemma this is called Bélády’s anomaly [BEDALY69]. It shows that increasing the number of pages might increase the number of evictions if the used page-replacement algorithm is not optimal. When pages that might be required soon are evicted and then loaded again, pages start competing for space in the cache. Because of that, we need to wisely consider the algorithm we’re using, so that it would improve the situation, not make it worse.

先进先出和最近最少使用

FIFO and LRU

最简单的页面替换策略是先进先出(FIFO)。FIFO 按照插入顺序维护页面 ID 队列,将新页面添加到队列尾部。每当页面缓存已满时,它就会从队列头部取出元素来查找最远时间点被调入的页面。由于它不考虑后续页面访问,仅考虑页面调入事件,因此这对于大多数现实世界的系统来说是不切实际的。例如,根页面和最顶层页面首先被调入,根据该算法,它们是第一个被逐出的候选者,尽管从树结构中可以清楚地看出这些页面很可能很快(如果不是立即)再次调入。

The most naïve page-replacement strategy is first in, first out (FIFO). FIFO maintains a queue of page IDs in their insertion order, adding new pages to the tail of the queue. Whenever the page cache is full, it takes the element from the head of the queue to find the page that was paged in at the farthest point in time. Since it does not account for subsequent page accesses, only for page-in events, this proves to be impractical for the most real-world systems. For example, the root and topmost-level pages are paged in first and, according to this algorithm, are the first candidates for eviction, even though it’s clear from the tree structure that these pages are likely to paged in again soon, if not immediately.

AFIFO 算法的自然扩展最近最少使用的(LRU) [TANENBAUM14]。它还按插入顺序维护一个逐出候选队列,但允许您在重复访问时将页面放回到队列的尾部,就好像这是第一次被分页一样。但是,更新引用并重新链接节点在并发环境中,每次访问都会变得昂贵。

A natural extension of the FIFO algorithm is least-recently used (LRU) [TANENBAUM14]. It also maintains a queue of eviction candidates in insertion order, but allows you to place a page back to the tail of the queue on repeated accesses, as if this was the first time it was paged in. However, updating references and relinking nodes on every access can become expensive in a concurrent environment.

还有其他基于 LRU 的缓存逐出策略。例如,2Q(双队列 LRU)维护两个队列,并在初始访问期间将页面放入第一个队列,并在后续访问时将它们移至第二个队列,允许您区分最近访问的页面和经常访问的页面[JONSON94]。LRU-K 通过跟踪最后的K访问来识别频繁引用的页面,并使用此信息来估计基于页面的访问时间[ONEIL93]

There are other LRU-based cache eviction strategies. For example, 2Q (Two-Queue LRU) maintains two queues and puts pages into the first queue during the initial access and moves them to the second hot queue on subsequent accesses, allowing you to distinguish between the recently and frequently accessed pages [JONSON94]. LRU-K identifies frequently referenced pages by keeping track of the last K accesses, and using this information to estimate access times on a page basis [ONEIL93].

CLOCK

某些情况下,效率可能比精度更重要。CLOCK算法变体通常用作 LRU 的紧凑、缓存友好和并发替代方案[SOUNDARARARJAN06]。例如,Linux 使用CLOCK 算法的变体

In some situations, efficiency may be more important than precision. CLOCK algorithm variants are often used as compact, cache-friendly, and concurrent alternatives to LRU [SOUNDARARARJAN06]. Linux, for example, uses a variant of the CLOCK algorithm.

时钟扫描保存对页面和相关访问位的引用循环缓冲区。某些变体使用计数器而不是位来计算频率。每次访问该页时,其访问位都会设置为1。该算法的工作原理是绕过循环缓冲区,检查访问位:

CLOCK-sweep holds references to pages and associated access bits in a circular buffer. Some variants use counters instead of bits to account for frequency. Every time the page is accessed, its access bit is set to 1. The algorithm works by going around the circular buffer, checking access bits:

  • 如果访问位为1,并且该页未被引用,则将其设置为0,并检查下一页。

  • If the access bit is 1, and the page is unreferenced, it is set to 0, and the next page is inspected.

  • 如果访问位已经存在0,则该页成为候选人并计划被驱逐。

  • If the access bit is already 0, the page becomes a candidate and is scheduled for eviction.

  • 如果该页当前被引用,则其访问位保持不变。假设被访问页的访问位不能为0,因此不能被驱逐。这使得引用的页面不太可能被替换。

  • If the page is currently referenced, its access bit remains unchanged. It is assumed that the access bit of an accessed page cannot be 0, so it cannot be evicted. This makes referenced pages less likely to be replaced.

图 5-2显示了带有访问位的循环缓冲区。

Figure 5-2 shows a circular buffer with access bits.

数据库0502
图 5-2。时钟扫描示例。当前引用页面的计数器显示为灰色。未引用页面的计数器显示为白色。箭头指向接下来要检查的元素。

使用循环缓冲区的优点是可以使用比较和交换操作来修改时钟指针和内容,并且不需要额外的锁定机制。该算法易于理解和实现,并且经常用于教科书[TANENBAUM14]和实际系统中。

An advantage of using a circular buffer is that both the clock hand pointer and contents can be modified using compare-and-swap operations, and do not require additional locking mechanisms. The algorithm is easy to understand and implement and is often used in both textbooks [TANENBAUM14] and real-wold systems.

LRU 并不总是数据库系统的最佳替换策略。有时,可能会考虑使用频率而不是新近度作为预测因素更为实际。最后,对于负载较重的数据库系统来说,新近度可能不太具有指示性,因为它仅代表访问项目的顺序。

LRU is not always the best replacement strategy for a database system. Sometimes, it may be more practical to consider usage frequency rather than recency as a predictive factor. In the end, for a database system under a heavy load, recency might not be very indicative as it only represents the order in which items were accessed.

LFU

LFU

改善情况,我们可以开始跟踪页面引用事件而不是页面调入事件。允许我们执行此操作的方法之一是跟踪最不常用 (LFU) 页面。

To improve the situation, we can start tracking page reference events rather than page-in events. One of the approaches allowing us to do this tracks least-frequently used (LFU) pages.

小LFU,一个基于频率的页面驱逐策略[EINZIGER17]正是这样做的:它不是根据页面进入新近度驱逐页面,而是对页面进行排序使用频率它是在名为Caffeine的流行 Java 库中实现的。

TinyLFU, a frequency-based page-eviction policy [EINZIGER17], does precisely this: instead of evicting pages based on page-in recency, it orders pages by usage frequency. It is implemented in the popular Java library called Caffeine.

TinyLFU 使用频率直方图[CORMODE11]来维护紧凑的缓存访问历史记录,因为对于实际目的来说,保留整个历史记录可能会非常昂贵。

TinyLFU uses a frequency histogram [CORMODE11] to maintain compact cache access history, since preserving an entire history might be prohibitively expensive for practical purposes.

元素可以位于三个队列之一:

Elements can be in one of the three queues:

  • 准入、维持新添加的元素,使用 LRU 策略实现。

  • Admission, maintaining newly added elements, implemented using LRU policy.

  • 缓刑、拘留最有可能被驱逐的元素。

  • Probation, holding elements most likely to get evicted.

  • 受保护、持有将在队列中停留较长时间的元素。

  • Protected, holding elements that are to stay in the queue for a longer time.

这种方法不是每次都选择要驱逐哪些元素,而是选择要提升哪些元素以保留。只有频率大于因升级而被驱逐的项目的项目才能移至试用队列。在后续访问中,项目可以从试用队列移至受保护队列。如果受保护队列已满,则其中的元素之一可能必须重新放回试用状态。访问频率较高的项目保留的机会较高,使用频率较低的项目更有可能被驱逐。

Rather than choosing which elements to evict every time, this approach chooses which ones to promote for retention. Only the items that have a frequency larger than the item that would be evicted as a result of promoting them, can be moved to the probation queue. On subsequent accesses, items can get moved from probation to the protected queue. If the protected queue is full, one of the elements from it may have to be placed back into probation. More frequently accessed items have a higher chance of retention, and less frequently used ones are more likely to be evicted.

图5-3显示了准入队列、试用队列、受保护队列、频率过滤器和驱逐之间的逻辑连接。

Figure 5-3 shows the logical connections between the admission, probation, and protected queues, the frequency filter, and eviction.

数据库0503
图 5-3。TinyLFU 准入、受保护和试用队列

还有许多其他算法可用于最佳缓存驱逐。页面替换策略的选择对延迟和执行的 I/O 操作的数量有重大影响,必须予以考虑。

There are many other algorithms that can be used for optimal cache eviction. The choice of a page-replacement strategy has a significant impact on latency and the number of performed I/O operations, and has to be taken into consideration.

恢复

Recovery

数据库系统构建在多个硬件和软件层之上,这些硬件和软件层可能存在自身的稳定性和可靠性问题。数据库系统本身以及底层软件和硬件组件可能会发生故障。数据库实施者必须考虑这些故障场景,并确保“承诺”写入的数据实际上已写入。

Database systems are built on top of several hardware and software layers that can have their own stability and reliability problems. Database systems themselves, as well as the underlying software and hardware components, may fail. Database implementers have to consider these failure scenarios and make sure that the data that was “promised” to be written is, in fact, written.

预写日志(简称 WAL,也称为提交日志)是一种仅附加的辅助磁盘驻留结构,用于崩溃和事务恢复。页缓存允许缓冲内存中页内容的更改。在缓存的内容刷新到磁盘之前,保留操作历史记录的唯一磁盘驻留副本存储在 WAL 中。许多数据库系统使用仅附加预写日志;例如PostgreSQLMySQL

A write-ahead log (WAL for short, also known as a commit log) is an append-only auxiliary disk-resident structure used for crash and transaction recovery. The page cache allows buffering changes to page contents in memory. Until the cached contents are flushed back to disk, the only disk-resident copy preserving the operation history is stored in the WAL. Many database systems use append-only write-ahead logs; for example, PostgreSQL and MySQL.

预写日志的主要功能可以概括为:

The main functionality of a write-ahead log can be summarized as:

  • 允许页面缓存缓冲对磁盘驻留页面的更新,同时确保数据库系统的更大上下文中的持久性语义。

  • Allow the page cache to buffer updates to disk-resident pages while ensuring durability semantics in the larger context of a database system.

  • 将所有操作保留在磁盘上,直到受这些操作影响的页面的缓存副本在磁盘上同步。每个修改数据库状态的操作都必须先记录在磁盘上,然后才能修改关联页面的内容。

  • Persist all operations on disk until the cached copies of pages affected by these operations are synchronized on disk. Every operation that modifies the database state has to be logged on disk before the contents of the associated pages can be modified.

  • 允许在发生崩溃时从操作日志重建丢失的内存中更改。

  • Allow lost in-memory changes to be reconstructed from the operation log in case of a crash.

除了此功能之外,预写日志在事务处理中也发挥着重要作用。WAL 的重要性怎么强调都不为过,因为它确保数据到达持久存储并在崩溃时可用,因为未提交的数据会从日志中重放,并且崩溃前的数据库状态会完全恢复。在本节中,我们经常会提到 ARIES(利用语义的恢复和隔离算法),这是一种被广泛使用和引用的最先进的恢复算法[MOHAN92]

In addition to this functionality, the write-ahead log plays an important role in transaction processing. It is hard to overstate the importance of the WAL as it ensures that data makes it to the persistent storage and is available in case of a crash, as uncommitted data is replayed from the log and the pre-crash database state is fully restored. In this section, we will often refer to ARIES (Algorithm for Recovery and Isolation Exploiting Semantics), a state-of-the-art recovery algorithm that is widely used and cited [MOHAN92].

日志语义

Log Semantics

预写日志是仅追加的,其写入的内容是不可变的,因此对日志的所有写入都是顺序的。由于 WAL 是不可变的、仅追加的数据结构,因此读取器可以安全地访问其内容,直到最新的写入阈值,同时写入器继续将数据追加到日志尾部。

The write-ahead log is append-only and its written contents are immutable, so all writes to the log are sequential. Since the WAL is an immutable, append-only data structure, readers can safely access its contents up to the latest write threshold while the writer continues appending data to the log tail.

WAL 由日志记录组成。每张唱片都有其独特的、单调的增加日志序列号(LSN)。通常,LSN 由内部计数器或时间戳表示。由于日志记录不一定占用整个磁盘块,因此它们的内容被缓存在 日志缓冲区并在强制操作中刷新到磁盘上。当日志缓冲区填满时会发生强制,并且可以由事务管理器或页面缓存请求。所有日志记录都必须按 LSN 顺序刷新到磁盘上。

The WAL consists of log records. Every record has a unique, monotonically increasing log sequence number (LSN). Usually, the LSN is represented by an internal counter or a timestamp. Since log records do not necessarily occupy an entire disk block, their contents are cached in the log buffer and are flushed on disk in a force operation. Forces happen as the log buffers fill up, and can be requested by the transaction manager or a page cache. All log records have to be flushed on disk in LSN order.

除了单独的操作记录外,WAL 还保存指示事务完成的记录。在日志强制达到其提交记录的 LSN 之前,不能将事务视为已提交。

Besides individual operation records, the WAL holds records indicating transaction completion. A transaction can’t be considered committed until the log is forced up to the LSN of its commit record.

为了确保系统在回滚或恢复期间崩溃后能够继续正常运行,一些系统在撤消期间使用补偿日志记录(CLR)并将其存储在日志中。

To make sure the system can continue functioning correctly after a crash during rollback or recovery, some systems use compensation log records (CLR) during undo and store them in the log.

WAL 通常通过接口与主存储结构耦合,该接口允许 每当到达检查点时就修剪它。日志记录是数据库最关键的正确性方面之一,要正确执行这一点有些棘手:即使日志修剪和确保数据已进入主存储结构之间最轻微的分歧也可能会导致数据丢失。

The WAL is usually coupled with a primary storage structure by the interface that allows trimming it whenever a checkpoint is reached. Logging is one of the most critical correctness aspects of the database, which is somewhat tricky to get right: even the slightest disagreements between log trimming and ensuring that the data has made it to the primary storage structure may cause data loss.

检查点是日志了解达到特定标记的日志记录已完全持久并且不再需要的一种方式,这显着减少了数据库启动期间所需的工作量。A强制将所有脏页刷新到磁盘上的过程通常称为同步检查点,因为它完全同步主存储结构。

Checkpoints are a way for a log to know that log records up to a certain mark are fully persisted and aren’t required anymore, which significantly reduces the amount of work required during the database startup. A process that forces all dirty pages to be flushed on disk is generally called a sync checkpoint, as it fully synchronizes the primary storage structure.

刷新磁盘上的全部内容是相当不切实际的,并且需要暂停所有正在运行的操作,直到检查点完成,因此大多数数据库系统都实现模糊检查点。在这种情况下,存储last_checkpoint在日志头中的指针包含有关最后一个成功检查点的信息。模糊检查点以指定其开始的特殊begin_checkpoint日志记录开始,以日志记录结束end_checkpoint,其中包含有关脏页的信息以及事务表的内容。直到该记录指定的所有页面都被刷新为止,才认为检查点完整。页面被异步刷新,一旦完成,last_checkpoint记录将使用记录的 LSN 进行更新begin_checkpoint,并且在发生崩溃时,恢复过程将从那里开始[MOHAN92]

Flushing the entire contents on disk is rather impractical and would require pausing all running operations until the checkpoint is done, so most database systems implement fuzzy checkpoints. In this case, the last_checkpoint pointer stored in the log header contains the information about the last successful checkpoint. A fuzzy checkpoint begins with a special begin_checkpoint log record specifying its start, and ends with end_checkpoint log record, containing information about the dirty pages, and the contents of a transaction table. Until all the pages specified by this record are flushed, the checkpoint is considered to be incomplete. Pages are flushed asynchronously and, once this is done, the last_checkpoint record is updated with the LSN of the begin_checkpoint record and, in case of a crash, the recovery process will start from there [MOHAN92].

操作与数据日志

Operation Versus Data Log

一些数据库系统,例如 System R [CHAMBERLIN81],使用影子分页:一种写入时复制技术,确保数据持久性和事务原子性。新内容被放置到新的未发布的影子页面中,并通过指针翻转(从旧页面到包含更新内容的页面)使其可见。

Some database systems, for example System R [CHAMBERLIN81], use shadow paging: a copy-on-write technique ensuring data durability and transaction atomicity. New contents are placed into the new unpublished shadow page and made visible with a pointer flip, from the old page to the one holding updated contents.

任何状态变化都可以由前像和后像或相应的重做和撤消操作来表示。正在申请前像的重做操作会产生后像。类似地,对后像应用撤消操作会产生前像

Any state change can be represented by a before-image and an after-image or by corresponding redo and undo operations. Applying a redo operation to a before-image produces an after-image. Similarly, applying an undo operation to an after-image produces a before-image.

我们可以使用物理日志(存储完整的页面状态或对其进行按字节更改)或逻辑日志(存储必须针对当前状态执行的操作)将记录或页面从一种状态移动到另一种状态时间上向后和向前。跟踪物理和逻辑日志记录可应用的页面的确切状态非常重要。

We can use a physical log (that stores complete page state or byte-wise changes to it) or a logical log (that stores operations that have to be performed against the current state) to move records or pages from one state to the other, both backward and forward in time. It is important to track the exact state of the pages that physical and logical log records can be applied to.

物理日志记录图像前后的情况,需要记录受操作影响的整个页面。逻辑日志指定必须对页面应用哪些操作,例如"insert a data record X for key Y",以及相应的撤消操作,例如"remove the value associated with Y"

Physical logging records before and after images, requiring entire pages affected by the operation to be logged. A logical log specifies which operations have to be applied to the page, such as "insert a data record X for key Y", and a corresponding undo operation, such as "remove the value associated with Y".

在实践中,许多数据库系统结合使用这两种方法,使用逻辑日志记录来执行撤消(为了并发性和性能),使用物理日志记录来执行重做(为了提高恢复时间)[MOHAN92 ]

In practice, many database systems use a combination of these two approaches, using logical logging to perform an undo (for concurrency and performance) and physical logging to perform a redo (to improve recovery time) [MOHAN92].

盗窃和强制政策

Steal and Force Policies

为了确定何时必须将内存中所做的更改刷新到磁盘上,数据库管理系统定义窃取/不窃取和强制/不强制策略。这些策略主要适用于页面缓存,但最好在恢复上下文中讨论它们,因为它们对于可以与它们结合使用的恢复方法具有重大影响。

To determine when the changes made in memory have to be flushed on disk, database management systems define steal/no-steal and force/no-force policies. These policies are mostly applicable to the page cache, but they’re better discussed in the context of recovery, since they have a significant impact on which recovery approaches can be used in combination with them.

即使在事务提交之前也允许刷新事务修改的页面的恢复方法称为窃取策略。无窃取策略不允许刷新磁盘上任何未提交的事务内容。这里窃取脏页意味着将其内存内容刷新到磁盘并从磁盘加载不同的页面来代替它

A recovery method that allows flushing a page modified by the transaction even before the transaction has committed is called a steal policy. A no-steal policy does not allow flushing any uncommitted transaction contents on disk. To steal a dirty page here means flushing its in-memory contents to disk and loading a different page from disk in its place.

强制策略要求在事务提交之前事务修改的所有页面刷新到磁盘上。另一方面,无强制策略允许事务提交,即使在此事务期间修改的某些页面尚未刷新到磁盘上也是如此。这里强制使用页意味着在提交之前将其刷新到磁盘上。

A force policy requires all pages modified by the transactions to be flushed on disk before the transaction commits. On the other hand, a no-force policy allows a transaction to commit even if some pages modified during this transaction were not yet flushed on disk. To force a dirty page here means to flush it on disk before the commit.

理解和强制策略很重要,因为它们对事务撤消和重做有影响。撤消将更新回滚到已提交事务的强制页面,而重做则应用磁盘上已提交事务执行的更改。

Steal and force policies are important to understand, since they have implications for transaction undo and redo. Undo rolls back updates to forced pages for committed transactions, while redo applies changes performed by committed transactions on disk.

使用无窃取策略允许仅使用重做条目来实现恢复:旧副本包含在磁盘上的页面中,修改存储在日志中[WEIKUM01]。使用no-force,我们可以通过推迟页面的多次更新来缓冲它们。由于当时页面内容必须缓存在内存中,因此可能需要更大的页面缓存。

Using the no-steal policy allows implementing recovery using only redo entries: old copy is contained in the page on disk and modification is stored in the log [WEIKUM01]. With no-force, we potentially can buffer several updates to pages by deferring them. Since page contents have to be cached in memory for that time, a larger page cache may be needed.

当使用强制策略时,崩溃恢复不需要任何额外的工作来重建已提交事务的结果,因为这些事务修改的页面已经被刷新。使用此方法的一个主要缺点是,由于必要的 I/O,事务需要更长的时间才能提交。

When the force policy is used, crash recovery doesn’t need any additional work to reconstruct the results of committed transactions, since pages modified by these transactions are already flushed. A major drawback of using this approach is that transactions take longer to commit due to the necessary I/O.

更一般地说,在事务提交之前,我们需要有足够的信息来撤消其结果。如果事务触及的任何页面被刷新,我们需要在日志中保留撤消信息,直到它提交才能回滚。否则,我们必须在日志中保留重做记录,直到提交为止。在这两种情况下,在撤消或重做记录写入日志文件之前,事务无法提交。

More generally, until the transaction commits, we need to have enough information to undo its results. If any pages touched by the transaction are flushed, we need to keep undo information in the log until it commits to be able to roll it back. Otherwise, we have to keep redo records in the log until it commits. In both cases, transaction cannot commit until either undo or redo records are written to the logfile.

白羊座

ARIES

白羊座是一种窃取/无强制恢复算法。它使用物理重做来提高恢复期间的性能(因为可以更快地安装更改),并使用逻辑撤消来提高正常操作期间的并发性(因为逻辑撤消操作可以独立应用于页面)。它使用WAL记录来实现恢复期间的重复历史,在撤消未提交的事务之前完全重建数据库状态,并在撤消期间创建补偿日志记录[MOHAN92]

ARIES is a steal/no-force recovery algorithm. It uses physical redo to improve performance during recovery (since changes can be installed quicker) and logical undo to improve concurrency during normal operation (since logical undo operations can be applied to pages independently). It uses WAL records to implement repeating history during recovery, to completely reconstruct the database state before undoing uncommitted transactions, and creates compensation log records during undo [MOHAN92].

当数据库系统崩溃后重新启动时,恢复分三个阶段进行:

When the database system restarts after the crash, recovery proceeds in three phases:

  1. 分析阶段识别页面缓存中脏页以及崩溃时正在进行的事务。有关脏页的信息用于识别重做阶段的起点。在撤消阶段使用正在进行的事务列表来回滚未完成的事务。

  2. The analysis phase identifies dirty pages in the page cache and transactions that were in progress at the time of a crash. Information about dirty pages is used to identify the starting point for the redo phase. A list of in-progress transactions is used during the undo phase to roll back incomplete transactions.

  3. 重做阶段重复历史记录直至崩溃点,并将数据库恢复到之前的状态此阶段针对不完整的事务以及已提交但其内容未刷新到持久存储的事务。

  4. The redo phase repeats the history up to the point of a crash and restores the database to the previous state. This phase is done for incomplete transactions as well as ones that were committed but whose contents weren’t flushed to persistent storage.

  5. 撤消阶段回滚所有未完成的事务并将数据库恢复到最后的一致状态所有操作均按时间倒序回滚。如果数据库在恢复过程中再次崩溃,撤消事务的操作也会被记录下来以避免重复。

  6. The undo phase rolls back all incomplete transactions and restores the database to the last consistent state. All operations are rolled back in reverse chronological order. In case the database crashes again during recovery, operations that undo transactions are logged as well to avoid repeating them.

ARIES 使用 LSN 来识别日志记录,跟踪脏页表中运行事务修改的页面,并使用物理重做、逻辑撤消和模糊检查点。尽管描述该系统的论文于 1992 年发布,但大多数概念、方法和范例在今天的事务处理和恢复中仍然相关。

ARIES uses LSNs for identifying log records, tracks pages modified by running transactions in the dirty page table, and uses physical redo, logical undo, and fuzzy checkpointing. Even though the paper describing this system was released in 1992, most concepts, approaches, and paradigms are still relevant in transaction processing and recovery today.

并发控制

Concurrency Control

什么时候在《DBMS体系结构》中讨论数据库管理系统体系结构时,我们提到事务管理器和锁管理器协同工作来处理并发控制。并发控制是一组用于处理并发执行事务之间交互的技术。这些技术可以大致分为以下几类:

When discussing database management system architecture in “DBMS Architecture”, we mentioned that the transaction manager and lock manager work together to handle concurrency control. Concurrency control is a set of techniques for handling interactions between concurrently executing transactions. These techniques can be roughly grouped into the following categories:

乐观并发控制 (OCC)
Optimistic concurrency control (OCC)

允许事务执行并发的读写操作,并确定组合执行的结果是否可串行化。换句话说,事务不会相互阻塞,维护其操作的历史记录,并在提交之前检查这些历史记录是否存在可能的冲突。如果执行导致冲突,则冲突的事务之一将被中止。

Allows transactions to execute concurrent read and write operations, and determines whether or not the result of the combined execution is serializable. In other words, transactions do not block each other, maintain histories of their operations, and check these histories for possible conflicts before commit. If execution results in a conflict, one of the conflicting transactions is aborted.

多版本并发控制 (MVCC)
Multiversion concurrency control (MVCC)

保证通过允许存在记录的多个带时间戳的版本,由时间戳来标识过去某个时刻的数据库的一致视图。MVCC 可以使用验证技术来实现,只允许更新或提交事务之一获胜,也可以使用无锁技术(例如时间戳排序)或基于锁的技术(例如两阶段锁定)来实现。

Guarantees a consistent view of the database at some point in the past identified by the timestamp by allowing multiple timestamped versions of the record to be present. MVCC can be implemented using validation techniques, allowing only one of the updating or committing transactions to win, as well as with lockless techniques such as timestamp ordering, or lock-based ones, such as two-phase locking.

悲观(也称为保守)并发控制 (PCC)
Pessimistic (also known as conservative) concurrency control (PCC)

那里都是基于锁的保守方法和非锁保守方法,它们的不同之处在于管理和授予对共享资源的访问权限的方式。基于锁的方法要求事务维护数据库记录上的锁,以防止其他事务修改锁定的记录并评估正在修改的记录,直到该事务释放其锁。非锁定方法根据未完成事务的时间表维护读写操作列表并限制执行。当多个事务等待彼此释放锁才能继续时,悲观计划可能会导致死锁。

There are both lock-based and nonlocking conservative methods, which differ in how they manage and grant access to shared resources. Lock-based approaches require transactions to maintain locks on database records to prevent other transactions from modifying locked records and assessing records that are being modified until the transaction releases its locks. Nonlocking approaches maintain read and write operation lists and restrict execution, depending on the schedule of unfinished transactions. Pessimistic schedules can result in a deadlock when multiple transactions wait for each other to release a lock in order to proceed.

在本章中,我们集中讨论节点本地并发控制技术。在第 13 章中,您可以找到有关分布式事务和其他方法的信息,例如确定性并发控制(请参阅“Calvin 的分布式事务”)。

In this chapter, we concentrate on node-local concurrency control techniques. In Chapter 13, you can find information about distributed transactions and other approaches, such as deterministic concurrency control (see “Distributed Transactions with Calvin”).

在进一步讨论并发控制之前,我们需要定义一组我们试图解决的问题,并讨论事务操作如何重叠以及这种重叠会产生什么后果。

Before we can further discuss concurrency control, we need to define a set of problems we’re trying to solve and discuss how transaction operations overlap and what consequences this overlapping has.

可串行化

Serializability

交易由针对数据库状态执行的读取和写入操作以及业务逻辑(应用于读取内容的转换)组成。A 调度是从数据库系统角度执行一组事务所需的操作列表(即,仅与数据库状态交互的操作,例如读、写、提交或中止操作),因为假设所有其他操作无副作用(换句话说,对数据库状态没有影响)[MOLINA08]

Transactions consist of read and write operations executed against the database state, and business logic (transformations, applied to the read contents). A schedule is a list of operations required to execute a set of transactions from the database-system perspective (i.e., only ones that interact with the database state, such as read, write, commit, or abort operations), since all other operations are assumed to be side-effect free (in other words, have no impact on the database state) [MOLINA08].

A如果计划包含其中执行的每个事务的所有操作,则计划是完整的。正确的计划在逻辑上等同于原始操作列表,但它们的部分可以并行执行或出于优化目的重新排序,只要这不违反 ACID 属性和单个事务结果的正确性[WEIKUM01]

A schedule is complete if contains all operations from every transaction executed in it. Correct schedules are logical equivalents to the original lists of operations, but their parts can be executed in parallel or get reordered for optimization purposes, as long as this does not violate ACID properties and the correctness of the results of individual transactions [WEIKUM01].

A当调度中的事务完全独立执行并且没有任何交错时,调度被称为串行:每个先前的事务在下一个事务开始之前完全执行。与多个多步骤事务之间所有可能的交错相比,串行执行很容易推理。然而,总是一个接一个地执行事务会显着限制系统吞吐量并损害性能。

A schedule is said to be serial when transactions in it are executed completely independently and without any interleaving: every preceding transaction is fully executed before the next one starts. Serial execution is easy to reason about, as contrasted with all possible interleavings between several multistep transactions. However, always executing transactions one after another would significantly limit the system throughput and hurt performance.

我们需要找到一种方法来并发执行事务操作,同时保持串行调度的正确性和简单性。我们可以通过可序列化的调度来实现这一点。如果调度相当于同一组事务上的某个完整串行调度,则该调度是可序列化的。换句话说,它产生的结果与我们按某种顺序依次执行一组事务相同。图 5-4显示了三个并发事务,以及可能的执行历史(3! = 6可能性,以每种可能的顺序)。

We need to find a way to execute transaction operations concurrently, while maintaining the correctness and simplicity of a serial schedule. We can achieve this with serializable schedules. A schedule is serializable if it is equivalent to some complete serial schedule over the same set of transactions. In other words, it produces the same result as if we executed a set of transactions one after another in some order. Figure 5-4 shows three concurrent transactions, and possible execution histories (3! = 6 possibilities, in every possible order).

数据库0504
图 5-4。并发事务及其可能的顺序执行历史

事务隔离

Transaction Isolation

交易性数据库系统允许不同的隔离级别。隔离级别指定事务的某些部分如何以及何时可以并且应该对其他事务可见。换句话说,隔离级别描述了事务与其他并发执行的事务隔离的程度,以及执行过程中可能遇到哪些类型的异常。

Transactional database systems allow different isolation levels. An isolation level specifies how and when parts of the transaction can and should become visible to other transactions. In other words, isolation levels describe the degree to which transactions are isolated from other concurrently executing transactions, and what kinds of anomalies can be encountered during execution.

实现隔离是有代价的:为了防止不完整或临时写入跨越事务边界传播,我们需要额外的协调和同步,这会对性能产生负面影响。

Achieving isolation comes at a cost: to prevent incomplete or temporary writes from propagating over transaction boundaries, we need additional coordination and synchronization, which negatively impacts the performance.

读写异常

Read and Write Anomalies

SQL 标准[MELTON06]引用并描述了并发事务执行期间可能发生的读取异常:脏读、不可重复读和幻读。

The SQL standard [MELTON06] refers to and describes read anomalies that can occur during execution of concurrent transactions: dirty, nonrepeatable, and phantom reads.

事务可以从其他事务读取未提交的更改的情况。例如,事务使用地址字段的新值更新用户记录,并且事务在提交之前读取更新的地址。事务中止并回滚其执行结果。然而,已经能够读取这个值,因此它访问了从未提交的值。T1T2T1T1T2

A dirty read is a situation in which a transaction can read uncommitted changes from other transactions. For example, transaction T1 updates a user record with a new value for the address field, and transaction T2 reads the updated address before T1 commits. Transaction T1 aborts and rolls back its execution results. However, T2 has already been able to read this value, so it has accessed the value that has never been committed.

不可重复读取(有时称为模糊读取)是指事务两次查询同一行并得到不同结果的情况。例如,即使事务读取一行,然后事务修改它并提交此更改,也可能发生这种情况。如果在执行完成之前再次请求同一行,则结果将与上次运行不同。T1T2T1

A nonrepeatable read (sometimes called a fuzzy read) is a situation in which a transaction queries the same row twice and gets different results. For example, this can happen even if transaction T1 reads a row, then transaction T2 modifies it and commits this change. If T1 requests the same row again before finishing its execution, the result will differ from the previous run.

如果如果我们在事务期间使用范围读取(即,读取的不是单个数据记录,而是一系列记录),我们可能会看到幻影记录。幻是指事务两次查询同一行集并收到不同的结果。它类似于不可重复读取,但适用于范围查询。

If we use range reads during the transaction (i.e., read not a single data record, but a range of records), we might see phantom records. A phantom read is when a transaction queries the same set of rows twice and receives different results. It is similar to a nonrepeatable read, but holds for range queries.

那里也是具有类似语义的写异常:丢失更新、脏写和写倾斜。

There are also write anomalies with similar semantics: lost update, dirty write, and write skew.

发生更新丢失当交易和都尝试更新 的值时。并读取 的值。更新和提交,然后更新并提交。由于事务不知道彼此的存在,如果都允许提交,则 的结果将被 的结果覆盖,并且 from 的更新将丢失。T1T2VT1T2VT1VT2VT1T2T1

A lost update occurs when transactions T1 and T2 both attempt to update the value of V. T1 and T2 read the value of V. T1 updates V and commts, and T2 updates V after that and commits as well. Since the transactions are not aware about each other’s existence, if both of them are allowed to commit, the results of T1 will be overwritten by the results of T2, and the update from T1 will be lost.

其中一个事务获取未提交的值(即脏读)、修改它并保存它的情况。换句话说,当事务结果基于从未提交的值时。

A dirty write is a situation in which one of the transactions takes an uncommitted value (i.e., dirty read), modifies it, and saves it. In other words, when transaction results are based on the values that have never been committed.

出现写入倾斜当每笔单独的交易遵守所需的不变量时,但它们的组合不满足这些不变量。例如,交易和修改两个账户和的值。开始于并开始于. 账户价值允许为负数,只要两个账户的总和为非负数:。和分别尝试退出和。由于这些交易开始时,总共可用。两个事务都假设它们保留了不变量并且允许提交。提交后,has和hasT1T2A1A2A1100$A2150$A1 + A2 >= 0T1T2200$A1A2A1 + A2 = 250$250$A1-100$A2-50$,这显然违反了保持账户总金额为正的要求[FEKETE04]

A write skew occurs when each individual transaction respects the required invariants, but their combination does not satisfy these invariants. For example, transactions T1 and T2 modify values of two accounts A1 and A2. A1 starts with 100$ and A2 starts with 150$. The account value is allowed to be negative, as long as the sum of the two accounts is nonnegative: A1 + A2 >= 0. T1 and T2 each attempt to withdraw 200$ from A1 and A2, respectively. Since at the time these transactions start A1 + A2 = 250$, 250$ is available in total. Both transactions assume they’re preserving the invariant and are allowed to commit. After the commit, A1 has -100$ and A2 has -50$, which clearly violates the requirement to keep a sum of the accounts positive [FEKETE04].

隔离级别

Isolation Levels

最低(换句话说,最弱)隔离级别是未提交读取。在此隔离级别下,事务系统允许一个事务观察其他并发事务未提交的更改。换句话说,脏读是允许的。

The lowest (in other words, weakest) isolation level is read uncommitted. Under this isolation level, the transactional system allows one transaction to observe uncommitted changes of other concurrent transactions. In other words, dirty reads are allowed.

我们可以避免一些异常情况。例如,我们可以确保特定事务执行的任何读取只能读取已经提交的更改。但是,不能保证如果事务在稍后阶段再次尝试读取相同的数据记录,它将看到相同的值。如果两次读取之间存在已提交的修改,则同一事务中的两个查询将产生不同的结果。换句话说,不允许脏读,但允许幻读和不可重复读。这隔离级别称为读已提交。如果我们进一步禁止不可重复读取,我们获得可重复的读隔离级别。

We can avoid some of the anomalies. For example, we can make sure that any read performed by the specific transaction can only read already committed changes. However, it is not guaranteed that if the transaction attempts to read the same data record once again at a later stage, it will see the same value. If there was a committed modification between two reads, two queries in the same transaction would yield different results. In other words, dirty reads are not permitted, but phantom and nonrepeatable reads are. This isolation level is called read committed. If we further disallow nonrepeatable reads, we get a repeatable read isolation level.

最强的隔离级别是可串行化。正如我们在“可串行性”中已经讨论过的,它保证事务结果将以某种顺序出现,就好像事务是串行执行的(即,时间上没有重叠)。禁止并发执行会对数据库性能产生重大负面影响。只要事务的内部不变量保持并且可以并发执行,事务就可以重新排序,但它们的结果必须以某种串行顺序出现。

The strongest isolation level is serializability. As we already discussed in “Serializability”, it guarantees that transaction outcomes will appear in some order as if transactions were executed serially (i.e., without overlapping in time). Disallowing concurrent execution would have a substantial negative impact on the database performance. Transactions can get reordered, as long as their internal invariants hold and can be executed concurrently, but their outcomes have to appear in some serial order.

图 5-5显示了隔离级别及其允许的异常情况。

Figure 5-5 shows isolation levels and the anomalies they allow.

数据库0505
图 5-5。隔离级别和允许的异常

没有依赖性的事务可以按任何顺序执行,因为它们的结果是完全独立的。不像线性化(我们在分布式系统的上下文中讨论;请参阅“线性化” ),可串行化是以任意顺序执行的多个操作的属性。它并不暗示或试图对执行交易强加任何特定的顺序。隔离ACID 术语意味着可串行化[BAILIS14a]。不幸的是,实现可串行性需要协调。换句话说,并发执行的事务必须协调以保留不变量,并对冲突的执行施加串行顺序[BAILIS14b]

Transactions that do not have dependencies can be executed in any order since their results are fully independent. Unlike linearizability (which we discuss in the context of distributed systems; see “Linearizability”), serializability is a property of multiple operations executed in arbitrary order. It does not imply or attempt to impose any particular order on executing transactions. Isolation in ACID terms means serializability [BAILIS14a]. Unfortunately, implementing serializability requires coordination. In other words, transactions executing concurrently have to coordinate to preserve invariants and impose a serial order on conflicting executions [BAILIS14b].

一些数据库使用快照隔离。在快照隔离下,事务可以观察在其启动时提交的所有事务所执行的状态更改。每个事务都会获取数据快照并对其执行查询。该快照在事务执行期间不能更改。仅当事务所修改的值在执行时未发生更改时,事务才会提交。否则,它将中止并回滚。

Some databases use snapshot isolation. Under snapshot isolation, a transaction can observe the state changes performed by all transactions that were committed by the time it has started. Each transaction takes a snapshot of data and executes queries against it. This snapshot cannot change during transaction execution. The transaction commits only if the values it has modified did not change while it was executing. Otherwise, it is aborted and rolled back.

如果两个事务尝试修改相同的值,则只允许其中一个事务提交。这防止丢失更新异常。例如,事务和都尝试修改. 他们从快照中读取当前值,该快照包含在启动之前提交的所有事务的更改。无论哪个事务首先尝试提交,都会提交,而另一个事务则必须中止。失败的事务将重试而不是覆盖该值。T1T2VV

If two transactions attempt to modify the same value, only one of them is allowed to commit. This precludes a lost update anomaly. For example, transactions T1 and T2 both attempt to modify V. They read the current value of V from the snapshot that contains changes from all transactions that were committed before they started. Whichever transaction attempts to commit first, will commit, and the other one will have to abort. The failed transactions will retry instead of overwriting the value.

写入倾斜异常在快照隔离下是可能的,因为如果两个事务从本地状态读取、修改独立记录并保留本地不变量,则它们都被允许提交[FEKETE04]我们在“使用 Percolator 进行分布式事务”中的分布式事务上下文中更详细地讨论了快照隔离。

A write skew anomaly is possible under snapshot isolation, since if two transactions read from local state, modify independent records, and preserve local invariants, they both are allowed to commit [FEKETE04]. We discuss snapshot isolation in more detail in the context of distributed transactions in “Distributed Transactions with Percolator”.

乐观并发控制

Optimistic Concurrency Control

乐观并发控制假设事务冲突很少发生,我们可以验证事务以防止与并发执行的事务发生读/写冲突,并在提交结果之前确保可串行性,而不是使用锁和阻止事务执行。一般来说,交易执行分为三个阶段[WEIKUM01]

Optimistic concurrency control assumes that transaction conflicts occur rarely and, instead of using locks and blocking transaction execution, we can validate transactions to prevent read/write conflicts with concurrently executing transactions and ensure serializability before committing their results. Generally, transaction execution is split into three phases [WEIKUM01]:

读取阶段
Read phase

事务在其自己的私有上下文中执行其步骤,而不使任何更改对其他事务可见。在此步骤之后,所有事务依赖性(读集)以及事务产生的副作用(写集)都已知。

The transaction executes its steps in its own private context, without making any of the changes visible to other transactions. After this step, all transaction dependencies (read set) are known, as well as the side effects the transaction produces (write set).

验证阶段
Validation phase

检查并发事务的写入集,以确定其操作之间是否存在可能违反可串行性的冲突。如果事务正在读取的某些数据现在已过时,或者它将覆盖在其读取阶段提交的事务写入的某些值,则其私有上下文将被清除并重新启动读取阶段。换句话说,验证阶段确定提交事务是否保留 ACID 属性。

Read and write sets of concurrent transactions are checked for the presence of possible conflicts between their operations that might violate serializability. If some of the data the transaction was reading is now out-of-date, or it would overwrite some of the values written by transactions that committed during its read phase, its private context is cleared and the read phase is restarted. In other words, the validation phase determines whether or not committing the transaction preserves ACID properties.

写入阶段
Write phase

如果验证阶段尚未确定任何冲突,事务可以将其写入集从私有上下文提交到数据库状态。

If the validation phase hasn’t determined any conflicts, the transaction can commit its write set from the private context to the database state.

验证可以通过检查与已提交的事务(向后导向)或当前处于验证阶段的事务(向前导向)的冲突来完成。不同事务的验证和写入阶段应该以原子方式完成。当其他事务正在验证时,不允许提交任何事务。由于验证和写入阶段通常比读取阶段短,因此这是一个可以接受的折衷方案。

Validation can be done by checking for conflicts with the transactions that have already been committed (backward-oriented), or with the transactions that are currently in the validation phase (forward-oriented). Validation and write phases of different transactions should be done atomically. No transaction is allowed to commit while some other transaction is being validated. Since validation and write phases are generally shorter than the read phase, this is an acceptable compromise.

向后导向的并发控制确保对于任何一对事务和,以下属性成立:T1T2

Backward-oriented concurrency control ensures that for any pair of transactions T1 and T2, the following properties hold:

  • T1在读取阶段开始之前提交,因此允许提交。T2T2

  • T1 was committed before the read phase of T2 began, so T2 is allowed to commit.

  • T1在写入阶段之前提交,并且写入集与读取集不相交。换句话说,没有写任何应该看到的值。T2T1T2T1T2

  • T1 was committed before the T2 write phase, and the write set of T1 doesn’t intersect with the T2 read set. In other words, T1 hasn’t written any values T2 should have seen.

  • 的读取阶段已在 的读取阶段之前完成,并且 的写入集与 的读取或写入集不相交。换句话说,事务是对独立的数据记录集进行操作的,因此都允许提交。T1T2T2T1

  • The read phase of T1 has completed before the read phase of T2, and the write set of T2 doesn’t intersect with the read or write sets of T1. In other words, transactions have operated on independent sets of data records, so both are allowed to commit.

如果验证通常成功并且不必重试事务,则此方法非常有效,因为重试会对性能产生显着的负面影响。当然乐观并发还是有一个临界区,事务一次只能进入一个。允许某些操作非独占所有权的另一种方法是使用读写锁(以允许读者共享访问)和可升级锁(以允许在需要时将共享锁转换为独占锁)。

This approach is efficient if validation usually succeeds and transactions don’t have to be retried, since retries have a significant negative impact on performance. Of course, optimistic concurrency still has a critical section, which transactions can enter one at a time. Another approach that allows nonexclusive ownership for some operations is to use readers-writer locks (to allow shared access for readers) and upgradeable locks (to allow conversion of shared locks to exclusive when needed).

多版本并发控制

Multiversion Concurrency Control

多版本并发控制是通过允许多个记录版本并使用单调递增的事务 ID 或时间戳来实现数据库管理系统中事务一致性的一种方法。这允许读取和写入在存储级别上以最小的协调进行,因为读取可以继续访问旧值,直到提交新值。

Multiversion concurrency control is a way to achieve transactional consistency in database management systems by allowing multiple record versions and using monotonically incremented transaction IDs or timestamps. This allows reads and writes to proceed with a minimal coordination on the storage level, since reads can continue accessing older values until the new ones are committed.

MVCC区分已提交未提交版本,分别对应已提交和未提交事务的值版本。该值的最后提交版本被假定为current。一般来说,在这种情况下,事务管理器的目标是一次最多拥有一个未提交的值。

MVCC distinguishes between committed and uncommitted versions, which correspond to value versions of committed and uncommitted transactions. The last committed version of the value is assumed to be current. Generally, the goal of the transaction manager in this case is to have at most one uncommitted value at a time.

根据数据库系统实现的隔离级别,读操作可能被允许也可能不允许访问未提交的值[WEIKUM01]。多版本并发可以使用锁定、调度和冲突解决技术(例如两阶段锁定)或时间戳排序来实现。MVCC 实现快照隔离的主要用例之一[HELLERSTEIN07]

Depending on the isolation level implemented by the database system, read operations may or may not be allowed to access uncommitted values [WEIKUM01]. Multiversion concurrency can be implemented using locking, scheduling, and conflict resolution techniques (such as two-phase locking), or timestamp ordering. One of the major use cases for MVCC for implementing snapshot isolation [HELLERSTEIN07].

悲观并发控制

Pessimistic Concurrency Control

悲观并发控制计划比乐观计划更为保守。这些方案在运行时确定事务冲突并阻止或中止其执行。

Pessimistic concurrency control schemes are more conservative than optimistic ones. These schemes determine transaction conflicts while they’re running and block or abort their execution.

最简单的悲观(无锁)并发控制方案是时间戳排序,其中每个事务都有一个时间戳。是否允许执行事务操作取决于是否已经提交了时间戳较早的事务。为了实现这一点,事务管理器必须维护max_read_timestamp每个max_write_timestamp值,描述并发事务执行的读取和写入操作。

One of the simplest pessimistic (lock-free) concurrency control schemes is timestamp ordering, where each transaction has a timestamp. Whether or not transaction operations are allowed to be executed is determined by whether or not any transaction with an earlier timestamp has already been committed. To implement that, the transaction manager has to maintain max_read_timestamp and max_write_timestamp per value, describing read and write operations executed by concurrent transactions.

尝试读取时间戳低于 的值的读取max_write_timestamp操作会导致它们所属的事务中止,因为已经存在更新的值,并且允许此操作将违反事务顺序。

Read operations that attempt to read a value with a timestamp lower than max_write_timestamp cause the transaction they belong to be aborted, since there’s already a newer value, and allowing this operation would violate the transaction order.

同样,时间戳低于的写入max_read_timestamp操作会与最近的读取发生冲突。但是,允许时间戳低于的写入max_write_timestamp操作,因为我们可以安全地忽略过时的写入值。这猜想通常称为托马斯写规则 [THOMAS79]。一旦执行读或写操作,相应的最大时间戳值就会更新。中止的事务会以新的时间戳重新启动,否则它们肯定会再次中止[RAMAKRISHNAN03]

Similarly, write operations with a timestamp lower than max_read_timestamp would conflict with a more recent read. However, write operations with a timestamp lower than max_write_timestamp are allowed, since we can safely ignore the outdated written values. This conjecture is commonly called the Thomas Write Rule [THOMAS79]. As soon as read or write operations are performed, the corresponding maximum timestamp values are updated. Aborted transactions restart with a new timestamp, since otherwise they’re guaranteed to be aborted again [RAMAKRISHNAN03].

基于锁的并发控制

Lock-Based Concurrency Control

基于锁的并发控制方案是悲观并发控制的一种形式,它在数据库对象上使用显式锁定,而不是像时间戳排序等协议那样解析调度。使用锁的一些缺点是争用和可扩展性问题[REN16]

Lock-based concurrency control schemes are a form of pessimistic concurrency control that uses explicit locks on the database objects rather than resolving schedules, like protocols such as timestamp ordering do. Some of the downsides of using locks are contention and scalability issues [REN16].

最广泛的基于锁的技术是两阶段锁定(2PL),它将锁管理分为两个阶段:

One of the most widespread lock-based techniques is two-phase locking (2PL), which separates lock management into two phases:

  • 增长阶段(也称为扩展阶段),在此期间获取事务所需的所有锁,并且不释放任何锁。

  • The growing phase (also called the expanding phase), during which all locks required by the transaction are acquired and no locks are released.

  • 收缩阶段,在此期间释放在增长阶段获取的所有锁。

  • The shrinking phase, during which all locks acquired during the growing phase are released.

从这两个定义得出的一条规则是,事务一旦释放了至少一个锁,就无法立即获取任何锁。值得注意的是,2PL 并不排除事务在这些阶段之一期间执行步骤;然而,某些 2PL 变体(例如保守的 2PL)确实施加了这些限制。

A rule that follows from these two definitions is that a transaction cannot acquire any locks as soon as it has released at least one of them. It’s important to note that 2PL does not preclude transactions from executing steps during either one of these phases; however, some 2PL variants (such as conservative 2PL) do impose these limitations.

警告

尽管名称相似,但两阶段锁定是一个完全不同的概念两阶段提交(参见“两阶段提交”)。两阶段提交是一种用于分布式多分区事务的协议,而两阶段锁定是一种常用于实现可串行性的并发控制机制。

Despite similar names, two-phase locking is a concept that is entirely different from two-phase commit (see “Two-Phase Commit”). Two-phase commit is a protocol used for distributed multipartition transactions, while two-phase locking is a concurrency control mechanism often used to implement serializability.

僵局

Deadlocks

在锁定协议中,事务尝试获取数据库对象上的锁,如果不能立即授予锁,则事务必须等待锁被释放。当两个事务尝试获取继续执行所需的锁时,可能会出现一种情况,最终等待对方释放它们持有的其他锁。这种情况称为死锁

In locking protocols, transactions attempt to acquire locks on the database objects and, in case a lock cannot be granted immediately, a transaction has to wait until the lock is released. A situation may occur when two transactions, while attempting to acquire locks they require in order to proceed with execution, end up waiting for each other to release the other locks they hold. This situation is called a deadlock.

图5-6显示了死锁的示例:持有锁并等待锁释放,同时持有锁并等待释放。T1L1L2T2L2L1

Figure 5-6 shows an example of a deadlock: T1 holds lock L1 and waits for lock L2 to be released, while T2 holds lock L2 and waits for L1 to be released.

数据库0506
图 5-6。死锁示例

处理死锁的最简单方法是引入超时并中止长时间运行的事务(假设它们可能处于死锁状态)。另一种策略,保守的2PL,要求事务在执行任何操作之前获取所有锁,如果不能则中止。然而,这些方法极大地限制了系统并发性,并且数据库系统大多使用事务管理器检测或避免(换句话说,防止)死锁

The simplest way to handle deadlocks is to introduce timeouts and abort long-running transactions under the assumption that they might be in a deadlock. Another strategy, conservative 2PL, requires transactions to acquire all the locks before they can execute any of their operations and abort if they cannot. However, these approaches significantly limit system concurrency, and database systems mostly use a transaction manager to detect or avoid (in other words, prevent) deadlocks.

检测死锁通常使用等待图来完成,该图跟踪正在进行的事务之间的关系并在它们之间建立等待关系。

Detecting deadlocks is generally done using a waits-for graph, which tracks relationships between the in-flight transactions and establishes waits-for relationships between them.

该图中的循环表明存在死锁:事务正在等待,而事务又在等待。死锁检测可以定期(每个时间间隔一次)或连续(每次更新等待图时)进行[WEIKUM01]。其中一项事务(通常是最近尝试获取锁的事务)被中止。T1T2T1

Cycles in this graph indicate the presence of a deadlock: transaction T1 is waiting for T2 which, in turn, waits for T1. Deadlock detection can be done periodically (once per time interval) or continuously (every time the waits-for graph is updated) [WEIKUM01]. One of the transactions (usually, the one that attempted to acquire the lock more recently) is aborted.

为了避免死锁并将锁获取限制在不会导致死锁的情况下,事务管理器可以使用事务时间戳来确定其优先级。较低的时间戳通常意味着较高的优先级,反之亦然。

To avoid deadlocks and restrict lock acquisition to cases that will not result in a deadlock, the transaction manager can use transaction timestamps to determine their priority. A lower timestamp usually implies higher priority and vice versa.

如果事务尝试获取当前由 持有的锁,并且具有更高的优先级(它在 之前开始),我们可以使用以下限制之一来避免死锁[RAMAKRISHNAN03]T1T2T1T2

If transaction T1 attempts to acquire a lock currently held by T2, and T1 has higher priority (it started before T2), we can use one of the following restrictions to avoid deadlocks [RAMAKRISHNAN03]:

等死
Wait-die

T1被允许阻塞并等待锁。否则,T 1被中止并重新启动。换句话说,只有具有更高时间戳的事务才能阻止事务。

T1 is allowed to block and wait for the lock. Otherwise, T1 is aborted and restarted. In other words, a transaction can be blocked only by a transaction with a higher timestamp.

伤口等待
Wound-wait

T2被中止并重新开始(伤口)。否则(如果之前已经开始),是允许的T1 T2T2T1T1等待。换句话说,一个事务只能被时间戳较小的事务阻塞。

T2 is aborted and restarted (T1 wounds T2). Otherwise (if T2 has started before T1), T1 is allowed to wait. In other words, a transaction can be blocked only by a transaction with a lower timestamp.

事务处理需要调度程序来处理死锁。同时,锁存器(参见“锁存器”)依赖于程序员来确保死锁不会发生,而不依赖于死锁避免机制。

Transaction processing requires a scheduler to handle deadlocks. At the same time, latches (see “Latches”) rely on the programmer to ensure that deadlocks cannot happen and do not rely on deadlock avoidance mechanisms.

锁具

Locks

如果同时提交两个事务,修改重叠的数据段,则任何一个事务都不应观察另一个事务的部分结果,从而保持逻辑一致性。类似地,来自同一事务的两个线程必须观察相同的数据库内容,并且可以访问彼此的数据。

If two transactions are submitted concurrently, modifying overlapping segments of data, neither one of them should observe partial results of the other one, hence maintaining logical consistency. Similarly, two threads from the same transaction have to observe the same database contents, and have access to each other’s data.

在事务处理中,保护逻辑和物理数据完整性的机制之间存在区别。相应地,负责逻辑和物理完整性的两个概念是闩锁。这个命名有点不幸,因为这里所谓的锁存器通常指的是系统编程中的锁,但我们将在本节中阐明其区别和含义。

In transaction processing, there’s a distinction between the mechanisms that guard the logical and physical data integrity. The two concepts responsible logical and physical integrity are, correspondingly, locks and latches. The naming is somewhat unfortunate since what’s called a latch here is usually referred to as a lock in systems programming, but we’ll clarify the distinction and implications in this section.

锁用于隔离和调度重叠事务并管理数据库内容,但不管理内部存储结构,并且是在密钥上获取的。锁可以保护特定的密钥(无论它是否存在)或一系列密钥。锁通常在树实现之外存储和管理,代表一个更高级别的概念,由数据库锁管理器管理。

Locks are used to isolate and schedule overlapping transactions and manage database contents but not the internal storage structure, and are acquired on the key. Locks can guard either a specific key (whether it’s existing or nonexisting) or a range of keys. Locks are generally stored and managed outside of the tree implementation and represent a higher-level concept, managed by the database lock manager.

锁比锁存器更重要,并且在事务持续期间保持不变

Locks are more heavyweight than latches and are held for the duration of the transaction.

闩锁

Latches

另一方面,锁存器保护物理表示:叶页内容在插入、更新和删除操作期间被修改。非叶页面内容和树结构在操作期间被修改,导致从叶下和溢出传播的分裂和合并。锁存器在这些操作期间保护物理树表示(页面内容和树结构),并且是在页面级别获得的。任何页面都必须被锁存以允许对其进行安全的并发访问。无锁并发控制技术仍然要使用锁存器。

On the other hand, latches guard the physical representation: leaf page contents are modified during insert, update, and delete operations. Nonleaf page contents and a tree structure are modified during operations resulting in splits and merges that propagate from leaf under- and overflows. Latches guard the physical tree representation (page contents and the tree structure) during these operations and are obtained on the page level. Any page has to be latched to allow safe concurrent access to it. Lockless concurrency control techniques still have to use latches.

由于叶级别上的单个修改可能会传播到 B 树的更高级别,因此可能必须在多个级别上获取锁存器。执行查询不应该能够观察到处于不一致状态的页面,例如不完整的写入或部分节点分裂,在此期间数据可能同时存在于源节点和目标节点中,或者尚未传播到父节点。

Since a single modification on the leaf level might propagate to higher levels of the B-Tree, latches might have to be obtained on multiple levels. Executing queries should not be able to observe pages in an inconsistent state, such as incomplete writes or partial node splits, during which data might be present in both the source and target node, or not yet propagated to the parent.

相同的规则适用于父级或同级指针更新。一般规则是在尽可能短的时间内(即读取或更新页面时)保留锁存器,以增加并发性。

The same rules apply to parent or sibling pointer updates. A general rule is to hold a latch for the smallest possible duration—namely, when the page is read or updated—to increase concurrency.

干扰并发操作之间大致可以分为三类

Interferences between concurrent operations can be roughly grouped into three categories:

  • 并发读取,当多个线程访问同一页面而不修改它时。

  • Concurrent reads, when several threads access the same page without modifying it.

  • 并发更新,当多个线程尝试对同一页面进行修改时。

  • Concurrent updates, when several threads attempt to make modifications to the same page.

  • 当一个线程尝试修改页面内容,而另一个线程尝试访问同一页面进行读取时,边写边读。

  • Reading while writing, when one of the threads is trying to modify the page contents, and the other one is trying to access the same page for a read.

这些场景也适用于与数据库维护重叠的访问(例如后台进程,如“清理和维护”中所述)。

These scenarios also apply to accesses that overlap with database maintenance (such as background processes, as described in “Vacuum and Maintenance”).

读写锁

Readers-writer lock

最简单的闩锁实现将向请求线程授予独占读/写访问权限。然而,大多数时候,我们不需要将所有进程相互隔离。例如,读取可以并发访问页面而不会造成任何麻烦,因此我们只需要确保多个并发写入不重叠,并且读取器与写入者不重叠。为了实现这种粒度级别,我们可以使用读写锁或 RW 锁。

The simplest latch implementation would grant exclusive read/write access to the requesting thread. However, most of the time, we do not need to isolate all the processes from each other. For example, reads can access pages concurrently without causing any trouble, so we only need to make sure that multiple concurrent writers do not overlap, and readers do not overlap with writers. To achieve this level of granularity, we can use a readers-writer lock or RW lock.

RW 锁允许多个读取者同时访问该对象,并且只有写入者(通常数量较少)必须获得对该对象的独占访问权。图 5-7显示了读写锁的兼容性表:只有读取者可以共享锁所有权,而所有其他读取者和写入者组合都应获得独占所有权。

An RW lock allows multiple readers to access the object concurrently, and only writers (which we usually have fewer of) have to obtain exclusive access to the object. Figure 5-7 shows the compatibility table for readers-writer locks: only readers can share lock ownership, while all other combinations of readers and writers should obtain exclusive ownership.

数据库0507
图 5-7。读写锁兼容性表

图 5-8 (a) 中,我们有多个读取器访问该对象,而写入器正在等待轮到它,因为当读取器访问页面时它无法修改页面。在图5-8 (b)中,writer 1持有对象的独占锁,而另一个写入者和三个读取者必须等待。

In Figure 5-8 (a), we have multiple readers accessing the object, while the writer is waiting for its turn, since it can’t modify the page while readers access it. In Figure 5-8 (b), writer 1 holds an exclusive lock on the object, while another writer and three readers have to wait.

数据库0508
图 5-8。读写锁

由于尝试访问同一页面的两个重叠读取除了防止页面缓存两次从磁盘获取页面之外不需要同步,因此可以在共享模式下安全地并发执行读取。一旦写入开始发挥作用,我们就需要将它们与并发读取和其他写入隔离。

Since two overlapping reads attempting to access the same page do not require synchronization other than preventing the page from being fetched from disk by the page cache twice, reads can be safely executed concurrently in shared mode. As soon as writes come into play, we need to isolate them from both concurrent reads and other writes.

闩锁捕蟹

Latch crabbing

获取锁存器的最直接方法是获取从根到目标叶的途中的所有锁存器。这会产生并发瓶颈,在大多数情况下是可以避免的。应尽量减少保持锁存器的时间。可用于实现这一目标的优化之一称为闩锁蟹控(或闩锁耦合)[RAMAKRISHNAN03]

The most straightforward approach for latch acquisition is to grab all the latches on the way from the root to the target leaf. This creates a concurrency bottleneck and can be avoided in most cases. The time during which a latch is held should be minimized. One of the optimizations that can be used to achieve that is called latch crabbing (or latch coupling) [RAMAKRISHNAN03].

闩锁蟹蟹是一种相当简单的方法,它允许保持闩锁的时间较短,并在明确执行操作不再需要它们时立即释放它们。在读路径上,一旦找到子节点并获取其锁存器,就可以释放父节点的锁存器。

Latch crabbing is a rather simple method that allows holding latches for less time and releasing them as soon as it’s clear that the executing operation does not require them anymore. On the read path, as soon as the child node is located and its latch is acquired, the parent node’s latch can be released.

在插入期间,如果保证操作不会导致可以传播到它的结构更改,则可以释放父锁存器。换句话说,如果子节点未满,则可以释放父锁存器。

During insert, the parent latch can be released if the operation is guaranteed not to result in structural changes that can propagate to it. In other words, the parent latch can be released if the child node is not full.

类似地,在删除过程中,如果子节点持有足够的元素并且操作不会导致兄弟节点合并,则释放父节点上的闩锁。

Similarly, during deletes, if the child node holds enough elements and the operation will not cause sibling nodes to merge, the latch on the parent node is released.

图 5-9显示了插入期间从根到叶的传递:

Figure 5-9 shows a root-to-leaf pass during insert:

  • a) 在根级别获取写锁存器。

  • a) The write latch is acquired on the root level.

  • b) 定位下一级节点,并获取其写锁存器。检查节点是否存在潜在的结构变化。由于节点未满,因此可以释放父锁存器。

  • b) The next-level node is located, and its write latch is acquired. The node is checked for potential structural changes. Since the node is not full, the parent latch can be released.

  • c) 操作下降到下一个级别。获取写锁存器,检查目标叶节点是否有潜在的结构变化,并释放父锁存器。

  • c) The operation descends to the next level. The write latch is acquired, the target leaf node is checked for potential structural changes, and the parent latch is released.

这种方法是乐观的:大多数插入和删除操作不会导致向上传播多个级别的结构变化。事实上,结构变化的可能性在较高水平上会降低。大多数操作只需要目标节点上的latch,并且需要保留父latch的情况相对较少。

This approach is optimistic: most insert and delete operations do not cause structural changes that propagate multiple levels up. In fact, the probability of structural changes decreases at higher levels. Most of the operations only require the latch on the target node, and the number of cases when the parent latch has to be retained is relatively small.

如果子页面仍未加载到页面缓存中,我们可以锁存将来加载的页面,或者释放父锁存器并在页面加载后重新启动根到叶传递以减少争用。重新启动根到叶的遍历听起来相当昂贵,但实际上,我们必须很少执行它,并且可以采用机制来检测自遍历以来更高级别是否存在任何结构变化[GRAEFE10 ]

If the child page is still not loaded in the page cache, we can either latch a future loading page, or release a parent latch and restart the root-to-leaf pass after the page is loaded to reduce contention. Restarting root-to-leaf traversal sounds rather expensive, but in reality, we have to perform it rather infrequently, and can employ mechanisms to detect whether or not there were any structural changes at higher levels since the time of traversal [GRAEFE10].

数据库0509
图 5-9。插入期间闩锁偏斜

概括

Summary

在本章中,我们讨论了负责事务处理和恢复的存储引擎组件。在实现事务处理时,我们遇到两个问题:

In this chapter, we discussed the storage engine components responsible for transaction processing and recovery. When implementing transaction processing, we are presented with two problems:

  • 为了提高效率,我们需要允许事务并发执行。

  • To improve efficiency, we need to allow concurrent transaction execution.

  • 为了保持正确性,我们必须确保并发执行的事务保持 ACID 属性。

  • To preserve correctness, we have to ensure that concurrently executing transactions preserve ACID properties.

并发事务执行可能会导致不同类型的读写异常。通过实施不同的隔离级别来描述和限制这些异常的存在或不存在。并发控制方法决定事务的调度和执行方式。

Concurrent transaction execution can cause different kinds of read and write anomalies. Presence or absence of these anomalies is described and limited by implementing different isolation levels. Concurrency control approaches determine how transactions are scheduled and executed.

页缓存负责减少磁盘访问次数:它将页缓存在内存中并允许对它们进行读写访问。当缓存达到其容量时,页面将被逐出并刷新回磁盘。为了确保未刷新的更改在节点崩溃时不会丢失并支持事务回滚,我们使用预写日志。页面缓存和预写日志使用强制和窃取策略进行协调,确保每个事务都可以高效执行并回滚,而不会牺牲耐用性。

The page cache is responsible for reducing the number of disk accesses: it caches pages in memory and allows read and write access to them. When the cache reaches its capacity, pages are evicted and flushed back on disk. To make sure that unflushed changes are not lost in case of node crashes and to support transaction rollback, we use write-ahead logs. The page cache and write-ahead logs are coordinated using force and steal policies, ensuring that every transaction can be executed efficiently and rolled back without sacrificing durability.

第 6 章B 树变体

Chapter 6. B-Tree Variants

B 树变体有一些共同点:树结构、通过拆分和合并进行平衡以及查找和删除算法。与并发性、磁盘上页面表示、兄弟节点之间的链接以及维护过程相关的其他细节可能因实现而异。

B-Tree variants have a few things in common: tree structure, balancing through splits and merges, and lookup and delete algorithms. Other details, related to concurrency, on-disk page representation, links between sibling nodes, and maintenance processes, may vary between implementations.

在本章中,我们将讨论可用于实现高效 B 树和使用它们的结构的几种技术:

In this chapter, we’ll discuss several techniques that can be used to implement efficient B-Trees and structures that employ them:

  • 写时复制 B 树的结构与 B 树类似,但它们的节点是不可变的并且不会就地更新。相反,页面会被复制、更新并写入新位置。

  • Copy-on-write B-Trees are structured like B-Trees, but their nodes are immutable and are not updated in place. Instead, pages are copied, updated, and written to new locations.

  • 惰性 B 树通过缓冲节点更新来减少后续同一节点写入的 I/O 请求数量。在下一章中,我们还将介绍双组件 LSM 树(请参阅“双组件 LSM 树”),它进一步采用缓冲来实现完全不可变的 B 树。

  • Lazy B-Trees reduce the number of I/O requests from subsequent same-node writes by buffering updates to nodes. In the next chapter, we also cover two-component LSM trees (see “Two-component LSM Tree”), which take buffering a step further to implement fully immutable B-Trees.

  • FD-Tree采用不同的缓冲方法,有点类似于 LSM 树(请参阅“LSM 树”)。FD 树在小型 B 树中缓冲更新。一旦这棵树填满,它的内容就会被写入不可变的运行中。更新以级联方式在不可变运行的级别之间传播,从较高级别到较低级别。

  • FD-Trees take a different approach to buffering, somewhat similar to LSM Trees (see “LSM Trees”). FD-Trees buffer updates in a small B-Tree. As soon as this tree fills up, its contents are written into an immutable run. Updates propagate between levels of immutable runs in a cascading manner, from higher levels to lower ones.

  • Bw-Tree将 B-Tree 节点分成几个较小的部分,这些部分以仅追加的方式编写。通过将不同节点的更新一起批量化,可以降低小规模写入的成本。

  • Bw-Trees separate B-Tree nodes into several smaller parts that are written in an append-only manner. This reduces costs of small writes by batching updates to the different nodes together.

  • 忽略缓存的 B 树允许以与构建内存数据结构非常相似的方式处理磁盘数据结构。

  • Cache-oblivious B-Trees allow treating on-disk data structures in a way that is very similar to how we build in-memory ones.

写时复制

Copy-on-Write

一些数据库不构建复杂的锁存机制,而是使用写时复制技术来保证并发操作时数据的完整性。在这种情况下,每当要修改页面时,都会复制其内容,修改复制的页面而不是原始页面,并创建并行树层次结构。

Some databases, rather than building complex latching mechanisms, use the copy-on-write technique to guarantee data integrity in the presence of concurrent operations. In this case, whenever the page is about to be modified, its contents are copied, the copied page is modified instead of the original one, and a parallel tree hierarchy is created.

与写入器同时运行的读取器仍然可以访问旧树版本,而访问修改页面的写入器必须等到前面的写入操作完成。创建新的页面层次结构后,指向最顶层页面的指针会自动更新。在图 6-1中,您可以看到一棵与旧树并行创建的新树,重用了未触及的页面。

Old tree versions remain accessible for readers that run concurrently to the writer, while writers accessing modified pages have to wait until preceding write operations are complete. After the new page hierarchy is created, the pointer to the topmost page is atomically updated. In Figure 6-1, you can see a new tree being created parallel to the old one, reusing the untouched pages.

数据库0601
图 6-1。写时复制 B 树

这种方法的一个明显缺点是它需要更多的空间(即使旧版本仅保留很短的时间,因为使用旧页面的并发操作完成后可以立即回收页面)和处理器时间,因为整个页面内容必须被复制。由于 B 树通常很浅,因此这种方法的简单性和优点通常仍然大于缺点。

An obvious downside of this approach is that it requires more space (even though old versions are retained only for brief time periods, since pages can be reclaimed immediately after concurrent operations using the old pages complete) and processor time, as entire page contents have to be copied. Since B-Trees are generally shallow, the simplicity and advantages of this approach often still outweigh the downsides.

这种方法的最大优点是读取器不需要同步,因为写入的页面是不可变的并且可以在没有额外锁存的情况下进行访问。由于写入是针对复制的页面执行的,因此读取器不会阻止写入器。任何操作都无法观察处于不完整状态的页面,并且系统崩溃不会使页面处于损坏状态,因为仅当所有页面修改完成时才会切换最顶层指针。

The biggest advantage of this approach is that readers require no synchronization, because written pages are immutable and can be accessed without additional latching. Since writes are performed against copied pages, readers do not block writers. No operation can observe a page in an incomplete state, and a system crash cannot leave pages in a corrupted state, since the topmost pointer is switched only when all page modifications are done.

实现写时复制:LMDB

Implementing Copy-on-Write: LMDB

使用写时复制的存储引擎之一是 Lightning 内存映射数据库 ( LMDB ),它是 OpenLDAP 项目使用的键值存储。由于其设计和架构,LMDB 不需要页面缓存、预写日志、检查点或压缩。1

One of the storage engines using copy-on-write is the Lightning Memory-Mapped Database (LMDB), which is a key-value store used by the OpenLDAP project. Due to its design and architecture, LMDB doesn’t require a page cache, a write-ahead log, checkpointing, or compaction.1

LMDB 实现为单级数据存储,这意味着直接通过内存映射满足读写操作,中间无需额外的应用程序级缓存。这也意味着页面不需要额外的具体化,并且可以直接从内存映射提供读取服务,而无需将数据复制到中间缓冲区。在更新期间,从根到目标叶的路径上的每个分支节点都被复制并可能被修改:更新传播的节点被更改,而其余节点保持不变。

LMDB is implemented as a single-level data store, which means that read and write operations are satisfied directly through the memory map, without additional application-level caching in between. This also means that pages require no additional materialization and reads can be served directly from the memory map without copying data to the intermediate buffer. During the update, every branch node on the path from the root to the target leaf is copied and potentially modified: nodes for which updates propagate are changed, and the rest of the nodes remain intact.

LMDB 仅保存根节点的两个版本:最新版本和将要提交新更改的版本。这已经足够了,因为所有写入都必须经过根节点。创建新根后,旧根将无法进行新的读写。一旦引用旧树部分的读取完成,它们的页面就会被回收并可以重用。由于 LMDB 的仅追加设计,它不使用同级指针,并且在顺序扫描期间必须升回到父节点。

LMDB holds only two versions of the root node: the latest version, and the one where new changes are going to be committed. This is sufficient since all writes have to go through the root node. After the new root is created, the old one becomes unavailable for new reads and writes. As soon as the reads referencing old tree sections complete, their pages are reclaimed and can be reused. Because of LMDB’s append-only design, it does not use sibling pointers and has to ascend back to the parent node during sequential scans.

通过这种设计,在复制的节点中保留过时的数据是不切实际的:已经有一个副本可用于 MVCC 并满足正在进行的读取事务。数据库结构本质上是多版本的,读取器可以在没有任何锁的情况下运行,因为它们不会以任何方式干扰写入器。

With this design, leaving stale data in copied nodes is impractical: there is already a copy that can be used for MVCC and satisfy ongoing read transactions. The database structure is inherently multiversioned, and readers can run without any locks as they do not interfere with writers in any way.

抽象节点更新

Abstracting Node Updates

无论以何种方式更新磁盘上的页面,我们都必须首先更新其在内存中的表示。但是,有几种方法可以在内存中表示节点:我们可以直接访问节点的缓存版本,通过包装器对象执行此操作,或者创建其本机于实现语言的内存中表示。

To update the page on disk, one way or the other, we have to first update its in-memory representation. However, there are a few ways to represent a node in memory: we can access the cached version of the node directly, do it through the wrapper object, or create its in-memory representation native to the implementation language.

在具有非托管内存模型的语言中,可以重新解释存储在 B 树节点中的原始二进制数据,并且可以使用本机指针来操作它。在这种情况下,节点是根据结构定义的,该结构在指针和运行时强制转换后面使用原始二进制数据。大多数情况下,它们指向由页面缓存管理的内存区域或使用内存映射。

In languages with an unmanaged memory model, raw binary data stored in B-Tree nodes can be reinterpreted and native pointers can be used to manipulate it. In this case, the node is defined in terms of structures, which use raw binary data behind the pointer and runtime casts. Most often, they point to the memory area managed by the page cache or use memory mapping.

或者,B 树节点可以具体化为该语言本机的对象或结构。这些结构可用于插入、更新和删除。在刷新期间,更改将应用​​到内存中的页面,然后应用到磁盘上。这种方法的优点是简化并发访问,因为对底层原始页面的更改与对中间对象的访问分开管理,但会导致更高的内存开销,因为我们必须存储相同版本的两个版本(原始二进制版本和语言本机版本)内存中的页面。

Alternatively, B-Tree nodes can be materialized into objects or structures native to the language. These structures can be used for inserts, updates, and deletes. During flush, changes are applied to pages in memory and, subsequently, on disk. This approach has the advantage of simplifying concurrent accesses since changes to underlying raw pages are managed separately from accesses to intermediate objects, but results in a higher memory overhead, since we have to store two versions (raw binary and language-native) of the same page in memory.

第三种方法是通过包装器对象提供对支持节点的缓冲区的访问,包装器对象在执行更改后立即在 B 树中实现更改。这种方法最常用于具有托管内存模型的语言。包装对象将更改应用于后备缓冲区。

The third approach is to provide access to the buffer backing the node through the wrapper object that materializes changes in the B-Tree as soon as they’re performed. This approach is most often used in languages with a managed memory model. Wrapper objects apply the changes to the backing buffers.

分别管理磁盘上的页面、它们的缓存版本以及它们在内存中的表示,允许它们具有不同的生命周期。例如,我们可以缓冲插入、更新和删除操作,并在读取期间将内存中所做的更改与磁盘上的原始版本进行协调。

Managing on-disk pages, their cached versions, and their in-memory representations separately allows them to have different life cycles. For example, we can buffer insert, update, and delete operations, and reconcile changes made in memory with the original on-disk versions during reads.

惰性 B 树

Lazy B-Trees

一些算法(在本书的范围内,我们称之为惰性 B 树2)降低了更新 B 树的成本,并使用更轻量级、并发性和更新友好的内存结构来缓冲更新并延迟传播它们。

Some algorithms (in the scope of this book, we call them lazy B-Trees2) reduce costs of updating the B-Tree and use more lightweight, concurrency- and update-friendly in-memory structures to buffer updates and propagate them with a delay.

连线虎

WiredTiger

让我们看看我们如何使用缓冲来实现惰性 B 树。为此,我们可以在 B 树节点被分页后立即在内存中具体化它们,并使用此结构来存储更新,直到我们准备好刷新它们为止。

Let’s take a look at how we can use buffering to implement a lazy B-Tree. For that, we can materialize B-Tree nodes in memory as soon as they are paged in and use this structure to store updates until we’re ready to flush them.

WiredTiger使用了类似的方法,它是现在默认的 MongoDB 存储引擎。它的行存储 B-Tree 实现对内存中和磁盘上的页面使用不同的格式。在内存中的页面被持久化之前,它们必须经历协调过程。

A similar approach is used by WiredTiger, a now-default MongoDB storage engine. Its row store B-Tree implementation uses different formats for in-memory and on-disk pages. Before in-memory pages are persisted, they have to go through the reconciliation process.

图 6-2中,您可以看到 WiredTiger 页面及其在 B 树中的组成的示意图。A 干净页仅包含一个索引,该索引最初是根据磁盘上的页映像构建的。更新首先保存到更新缓冲区中。

In Figure 6-2, you can see a schematic representation of WiredTiger pages and their composition in a B-Tree. A clean page consists of just an index, initially constructed from the on-disk page image. Updates are first saved into the update buffer.

数据库0602
图 6-2。WiredTiger:高级概述

更新缓冲区在读取期间被访问:它们的内容与原始磁盘页面内容合并以返回最新数据。刷新页面时,更新缓冲区内容与页面内容一致并保留在磁盘上,覆盖原始页面。如果协调页面的大小大于最大值,则会将其拆分为多个页面。更新缓冲区是使用实现的跳跃列表,其复杂性类似于搜索树[PAPADAKIS93],但具有更好的并发配置文件[PUGH90a]

Update buffers are accessed during reads: their contents are merged with the original on-disk page contents to return the most recent data. When the page is flushed, update buffer contents are reconciled with page contents and persisted on disk, overwriting the original page. If the size of the reconciled page is greater than the maximum, it is split into multiple pages. Update buffers are implemented using skiplists, which have a complexity similar to search trees [PAPADAKIS93] but have a better concurrency profile [PUGH90a].

图 6-3显示 WiredTiger 中的干净页和脏页都有内存版本,并引用磁盘上的基础映像。除此之外,脏页还有一个更新缓冲区。

Figure 6-3 shows that both clean and dirty pages in WiredTiger have in-memory versions, and reference a base image on disk. Dirty pages have an update buffer in addition to that.

这里的主要优点是页面更新和结构修改(拆分和合并)由后台线程执行,读/写过程不必等待它们完成。

The main advantage here is that the page updates and structural modifications (splits and merges) are performed by the background thread, and read/write processes do not have to wait for them to complete.

数据库0603
图 6-3。WiredTiger 页面

惰性自适应树

Lazy-Adaptive Tree

相当与缓冲对单个节点的更新相比,我们可以将节点分组为子树,并将用于批处理操作的更新缓冲区附加到每个子树。在这种情况下,更新缓冲区将跟踪针对子树顶部节点及其后代执行的所有操作。该算法称为惰性自适应树(LA-Tree)[AGRAWAL09]

Rather than buffering updates to individual nodes, we can group nodes into subtrees, and attach an update buffer for batching operations to each subtree. Update buffers in this case will track all operations performed against the subtree top node and its descendants. This algorithm is called Lazy-Adaptive Tree (LA-Tree) [AGRAWAL09].

当插入数据记录时,首先将新条目添加到根节点更新缓冲区中。当该缓冲区变满时,通过将更改复制并传播到较低树级别中的缓冲区来清空该缓冲区。如果较低层也填满,则该操作可以递归地继续,直到最终到达叶节点。

When inserting a data record, a new entry is first added to the root node update buffer. When this buffer becomes full, it is emptied by copying and propagating the changes to the buffers in the lower tree levels. This operation can continue recursively if the lower levels fill up as well, until it finally reaches the leaf nodes.

图 6-4中,您可以看到 LA 树,其节点具有级联缓冲区,这些节点分组在相应的子树中。灰色框表示从根缓冲区传播的更改。

In Figure 6-4, you see an LA-Tree with cascaded buffers for nodes grouped in corresponding subtrees. Gray boxes represent changes that propagated from the root buffer.

数据库0604
图 6-4。洛杉矶树

缓冲区具有层次依赖性并且是级联的:所有更新都从较高级别的缓冲区传播到较低级别的缓冲区。当更新到达叶级别时,将在那里执行批量插入、更新和删除操作,立即将所有更改应用于树内容及其结构。页面可以在单次运行中更新,而不是单独对页面执行后续更新,从而需要更少的磁盘访问和结构更改,因为拆分和合并也会批量传播到更高级别。

Buffers have hierarchical dependencies and are cascaded: all the updates propagate from higher-level buffers to the lower-level ones. When the updates reach the leaf level, batched insert, update, and delete operations are performed there, applying all changes to the tree contents and its structure at once. Instead of performing subsequent updates on pages separately, pages can be updated in a single run, requiring fewer disk accesses and structural changes, since splits and merges propagate to the higher levels in batches as well.

这里描述的缓冲方法通过批处理写入操作来优化树更新时间,但方式略有不同。这两种算法都需要在内存缓冲结构中进行额外的查找,并与陈旧的磁盘数据进行合并/协调。

The buffering approaches described here optimize tree update time by batching write operations, but in slightly different ways. Both algorithms require additional lookups in in-memory buffering structures and merge/reconciliation with stale disk data.

FD树

FD-Trees

缓冲是数据库存储中广泛使用的想法之一:它有助于避免许多小的随机写入并执行单个较大的写入。在 HDD 上,由于磁头定位的原因,随机写入速度很慢。在 SSD 上,没有移动部件,但额外的写入 I/O 会带来额外的垃圾收集损失。

Buffering is one of the ideas that is widely used in database storage: it helps to avoid many small random writes and performs a single larger write instead. On HDDs, random writes are slow because of the head positioning. On SSDs, there are no moving parts, but the extra write I/O imposes an additional garbage collection penalty.

维护 B 树需要大量随机写入(叶级写入、拆分和合并传播到父级),但如果我们可以完全避免随机写入和节点更新呢?

Maintaining a B-Tree requires a lot of random writes—leaf-level writes, splits, and merges propagating to the parents—but what if we could avoid random writes and node updates altogether?

到目前为止,我们已经讨论了通过创建辅助缓冲区来缓冲对单个节点或节点组的更新。另一种方法是通过使用仅附加存储和合并过程将针对不同节点的更新分组在一起,这个想法也启发了 LSM 树(请参阅“LSM 树”)。这意味着我们执行的任何写入都不需要定位写入的目标节点:所有更新都只是简单地附加。使用这种索引方法的示例之一称为闪存盘树(FD-Tree)[LI10]

So far we’ve discussed buffering updates to individual nodes or groups of nodes by creating auxiliary buffers. An alternative approach is to group updates targeting different nodes together by using append-only storage and merge processes, an idea that has also inspired LSM Trees (see “LSM Trees”). This means that any write we perform does not require locating a target node for the write: all updates are simply appended. One of the examples of using this approach for indexing is called Flash Disk Tree (FD-Tree) [LI10].

一个FD-Tree 由一个小的可变头树和多个不可变的排序运行组成。这种方法将需要随机写入 I/O 的表面区域限制到头树:缓冲更新的小型 B 树。一旦头树填满,它的内容就会转移到不可变的运行中。如果新写入的run的大小超过阈值,则其内容将与下一级合并,逐渐从上层到下层传播数据记录。

An FD-Tree consists of a small mutable head tree and multiple immutable sorted runs. This approach limits the surface area, where random write I/O is required, to the head tree: a small B-Tree buffering the updates. As soon as the head tree fills up, its contents are transferred to the immutable run. If the size of the newly written run exceeds the threshold, its contents are merged with the next level, gradually propagating data records from upper to lower levels.

分数级联

Fractional Cascading

为了维护级别之间的指针,FD-Tree 使用一种称为分数级联的技术 [CHAZELLE86]。这种方法有助于降低在级联排序数组中定位项目的成本:您执行log n步骤在第一个数组中查找搜索项目,但后续搜索的成本要低得多,因为它们从前一个数组中最接近的匹配开始搜索等级。

To maintain pointers between the levels, FD-Trees use a technique called fractional cascading [CHAZELLE86]. This approach helps to reduce the cost of locating an item in the cascade of sorted arrays: you perform log n steps to find the searched item in the first array, but subsequent searches are significantly cheaper, since they start the search from the closest match from the previous level.

级别之间的快捷方式由在相邻级数组之间建立桥梁以最小化间隙:没有来自更高级别的指针的元素组。桥是通过元素从较低级别拉到较高级别(如果这些元素尚不存在)并指向较低级别数组中所拉元素的位置来构建的。

Shortcuts between the levels are made by building bridges between the neighbor-level arrays to minimize the gaps: element groups without pointers from higher levels. Bridges are built by pulling elements from lower levels to the higher ones, if they don’t already exist there, and pointing to the location of the pulled element in the lower-level array.

由于[CHAZELLE86]解决了计算几何中的搜索问题,因此它描述了双向桥,以及用于恢复间隙大小不变量的算法,我们不会在这里介绍。我们仅描述适用于数据库存储的部分,特别是 FD-Tree。

Since [CHAZELLE86] solves a search problem in computational geometry, it describes bidirectional bridges, and an algorithm for restoring the gap size invariant that we won’t be covering here. We describe only the parts that are applicable to database storage and FD-Trees in particular.

我们可以创建从较高级别数组的每个元素到下一级最接近元素的映射,但这会导致指针及其维护产生过多的开销。如果我们只映射更高级别上已经存在的项目,我们最终可能会遇到元素之间的间隙过大的情况。为了解决这个问题,我们将N低层数组中的每一项都拉到高层数组中。

We could create a mapping from every element of the higher-level array to the closest element on the next level, but that would cause too much overhead for pointers and their maintenance. If we were to map only the items that already exist on a higher level, we could end up in a situation where the gaps between the elements are too large. To solve this problem, we pull every Nth item from the lower-level array to the higher one.

例如,如果我们有多个排序数组:

For example, if we have multiple sorted arrays:

A1 = [12, 24, 32, 34, 39]
A2 = [22, 25, 28, 30, 35]
A3 = [11, 16, 24, 26, 30]
A1 = [12, 24, 32, 34, 39]
A2 = [22, 25, 28, 30, 35]
A3 = [11, 16, 24, 26, 30]

我们可以通过将索引较高的数组中的所有其他元素拉到索引较低的数组中来弥合元素之间的差距,以简化搜索:

We can bridge the gaps between elements by pulling every other element from the array with a higher index to the one with a lower index in order to simplify searches:

A1 = [12, 24, 25, 30, 32, 34, 39]
A2 = [16, 22, 25, 26, 28, 30, 35]
A3 = [11, 16, 24, 26, 30]
A1 = [12, 24, 25, 30, 32, 34, 39]
A2 = [16, 22, 25, 26, 28, 30, 35]
A3 = [11, 16, 24, 26, 30]

现在,我们可以使用这些拉出的元素创建桥梁(或FD-Tree 论文中所说的栅栏):从较高级别的元素到较低级别的对应元素的指针,如图6-5所示。

Now, we can use these pulled elements to create bridges (or fences as the FD-Tree paper calls them): pointers from higher-level elements to their counterparts on the lower levels, as Figure 6-5 shows.

数据库0605
图 6-5。分数级联

为了搜索所有这些数组中的元素,我们在最高级别执行二分搜索,并且下一级的搜索空间显着减少,因为现在我们通过一座桥转发到搜索项的大致位置。这使我们能够连接多个排序运行并降低在其中搜索的成本。

To search for elements in all these arrays, we perform a binary search on the highest level, and the search space on the next level is reduced significantly, since now we are forwarded to the approximate location of the searched item by following a bridge. This allows us to connect multiple sorted runs and reduce the costs of searching in them.

对数游程

Logarithmic Runs

一个FD-Tree 将分数级联与创建对数大小的排序游程相结合:大小按 因子增加的不可变排序数组k,通过将前一级别与当前级别合并而创建。

An FD-Tree combines fractional cascading with creating logarithmically sized sorted runs: immutable sorted arrays with sizes increasing by a factor of k, created by merging the previous level with the current one.

当头树变满时,将创建最高级别的运行:其叶内容被写入第一级别。一旦头树再次填满,其内容就会与第一级项目合并。合并结果替换第一次运行的旧版本。当较高级别运行的大小达到阈值时,将创建较低级别的运行。如果较低级别的运行已存在,则将其内容替换为将其内容与较高级别的内容合并的结果。这个过程与 LSM 树中的压缩非常相似,其中不可变的表内容被合并以创建更大的表。

The highest-level run is created when the head tree becomes full: its leaf contents are written to the first level. As soon as the head tree fills up again, its contents are merged with the first-level items. The merged result replaces the old version of the first run. The lower-level runs are created when the sizes of the higher-level ones reach a threshold. If a lower-level run already exists, it is replaced by the result of merging its contents with the contents of a higher level. This process is quite similar to compaction in LSM Trees, where immutable table contents are merged to create larger tables.

图 6-6显示了 FD 树的示意图,顶部有一个头 B 树,两个对数游程L1L2,以及它们之间的桥。

Figure 6-6 shows a schematic representation of an FD-Tree, with a head B-Tree on the top, two logarithmic runs L1 and L2, and bridges between them.

数据库0606
图 6-6。FD 树概述示意图

为了保持所有排序运行中的项目可寻址,FD-Trees 使用部分级联的改编版本,其中 较低级别页面的头元素作为指向较高级别的指针进行传播。使用这些指针,可以降低在较低级别树中搜索的成本,因为搜索已经在较高级别上部分完成,并且可以从最接近的匹配继续。

To keep items in all sorted runs addressable, FD-Trees use an adapted version of fractional cascading, where head elements from lower-level pages are propagated as pointers to the higher levels. Using these pointers, the cost of searching in lower-level trees is reduced, since the search was already partially done on a higher level and can continue from the closest match.

由于 FD-Tree 不会就地更新页面,并且可能会发生同一键的数据记录存在于多个级别上的情况,因此 FD-Tree 通过插入墓碑来删除工作(FD-Tree 论文称其为 过滤条目),指示与相应键关联的数据记录被标记为删除,并且必须丢弃该键在较低级别中的所有数据记录。当墓碑一直传播到最低层时,它们可以被丢弃,因为可以保证不再有它们可以遮蔽的物品。

Since FD-Trees do not update pages in place, and it may happen that data records for the same key are present on several levels, the FD-Trees delete work by inserting tombstones (the FD-Tree paper calls them filter entries) that indicate that the data record associated with a corresponding key is marked for deletion, and all data records for that key in the lower levels have to be discarded. When tombstones propagate all the way to the lowest level, they can be discarded, since it is guaranteed that there are no items they can shadow anymore.

Bw树

Bw-Trees

写放大是 B 树就地更新实现的最重要问题之一:对 B 树页面的后续更新可能需要在每次更新时更新磁盘驻留页面副本。第二个问题是空间放大:我们保留额外的空间以使更新成为可能。这也意味着对于每个传输的携带请求数据的有用字节,我们必须传输一些空字节和页面的其余部分。第三个问题是解决并发问题和处理锁存器的复杂性。

Write amplification is one of the most significant problems with in-place update implementations of B-Trees: subsequent updates to a B-Tree page may require updating a disk-resident page copy on every update. The second problem is space amplification: we reserve extra space to make updates possible. This also means that for each transferred useful byte carrying the requested data, we have to transfer some empty bytes and the rest of the page. The third problem is complexity in solving concurrency problems and dealing with latches.

为了同时解决所有三个问题,我们必须采取与迄今为止讨论的方法完全不同的方法。缓冲更新有助于写入和空间放大,但无法解决并发问题。

To solve all three problems at once, we have to take an approach entirely different from the ones we’ve discussed so far. Buffering updates helps with write and space amplification, but offers no solution to concurrency issues.

我们可以通过使用仅追加存储来批量更新不同节点,将节点链接到链中,并使用内存中数据结构,该结构允许通过单个比较和交换操作在节点之间安装指针,从而使树锁定-自由的。这种方法称为流行语树(Bw-Tree)[LEVANDOSKI14]

We can batch updates to different nodes by using append-only storage, link nodes together into chains, and use an in-memory data structure that allows installing pointers between the nodes with a single compare-and-swap operation, making the tree lock-free. This approach is called a Buzzword-Tree (Bw-Tree) [LEVANDOSKI14] .

更新链

Update Chains

ABw-Tree 将基本节点与其修改分开写入。修改(增量节点)形成一条链:从最新修改到较旧修改的链表,最后是基节点。每个更新都可以单独存储,无需重写磁盘上的现有节点。增量节点可以表示插入、更新(与插入没有区别)或删除。

A Bw-Tree writes a base node separately from its modifications. Modifications (delta nodes) form a chain: a linked list from the newest modification, through older ones, with the base node in the end. Each update can be stored separately, without needing to rewrite the existing node on disk. Delta nodes can represent inserts, updates (which are indistinguishable from inserts), or deletes.

由于基本节点和增量节点的大小不太可能页面对齐,因此连续存储它们是有意义的,并且因为更新期间基本节点和增量节点都不会被修改(所有修改只是在现有链表前面添加一个节点),所以我们这样做无需预留任何额外空间。

Since the sizes of base and delta nodes are unlikely to be page aligned, it makes sense to store them contiguously, and because neither base nor delta nodes are modified during update (all modifications just prepend a node to the existing linked list), we do not need to reserve any extra space.

将节点作为逻辑实体而不是物理实体是一个有趣的范式变化:我们不需要预先分配空间,不需要节点具有固定大小,甚至不需要将它们保留在连续的内存段中。这当然有一个缺点:在读取期间,必须遍历所有增量并将其应用于基节点以重建实际的节点状态。这有点类似于 LA-Tree 的做法(请参阅“惰性自适应树”):将更新与主结构分开并在读取时重放它们。

Having a node as a logical, rather than physical, entity is an interesting paradigm change: we do not need to pre-allocate space, require nodes to have a fixed size, or even keep them in contiguous memory segments. This certainly has a downside: during a read, all deltas have to be traversed and applied to the base node to reconstruct the actual node state. This is somewhat similar to what LA-Trees do (see “Lazy-Adaptive Tree”): keeping updates separate from the main structure and replaying them on read.

通过比较和交换控制并发

Taming Concurrency with Compare-and-Swap

维护一个允许将项目添加到子节点的磁盘上树结构将是相当昂贵的:它需要我们不断地用指向最新增量的指针更新父节点。这就是为什么 Bw-Tree 节点(由增量链和基节点组成)具有逻辑标识符并使​​用从标识符到它们在磁盘上的位置的内存映射表。使用此映射还可以帮助我们摆脱锁存器:Bw-Tree 在映射表中的物理偏移量上使用比较和交换操作,而不是在写入期间拥有独占所有权。

It would be quite costly to maintain an on-disk tree structure that allows prepending items to child nodes: it would require us to constantly update parent nodes with pointers to the freshest delta. This is why Bw-Tree nodes, consisting of a chain of deltas and the base node, have logical identifiers and use an in-memory mapping table from the identifiers to their locations on disk. Using this mapping also helps us to get rid of latches: instead of having exclusive ownership during write time, the Bw-Tree uses compare-and-swap operations on physical offsets in the mapping table.

图 6-7显示了一个简单的 Bw 树。每个逻辑节点由单个基本节点和多个链接的增量节点组成。

Figure 6-7 shows a simple Bw-Tree. Each logical node consists of a single base node and multiple linked delta nodes.

数据库0607
图 6-7。Bw-树。虚线表示节点之间的虚拟链接,使用映射表解析。实线表示节点之间的实际数据指针。

为了更新 Bw-Tree 节点,算法执行以下步骤:

To update a Bw-Tree node, the algorithm executes the following steps:

  1. 通过从根到叶遍历树来定位目标逻辑叶节点。映射表包含到更新链中的目标基本节点或最新增量节点的虚拟链接。

  2. The target logical leaf node is located by traversing the tree from root to leaf. The mapping table contains virtual links to target base nodes or the latest delta nodes in the update chain.

  3. 创建一个新的增量节点,其中包含指向步骤 1 中找到的基节点(或最新增量节点)的指针。

  4. A new delta node is created with a pointer to the base node (or to the latest delta node) located during step 1.

  5. 映射表使用指向步骤 2 中创建的新增量节点的指针进行更新。

  6. The mapping table is updated with a pointer to the new delta node created during step 2.

步骤 3 期间的更新操作可以使用比较和交换来完成,这是一个原子操作,因此与指针更新并发的所有读取都在写入之前之后排序,而不会阻塞读取器或写入器。之前订购的读取遵循旧指针,看不到新的增量节点,因为它尚未安装。跟随新指针有序读取,并观察更新情况。如果两个线程尝试将新的增量节点安装到同一逻辑节点,则只有其中一个可以成功,另一个线程必须重试该操作。

An update operation during step 3 can be done using compare-and-swap, which is an atomic operation, so all reads, concurrent to the pointer update, are ordered either before or after the write, without blocking either the readers or the writer. Reads ordered before follow the old pointer and do not see the new delta node, since it was not yet installed. Reads ordered after follow the new pointer, and observe the update. If two threads attempt to install a new delta node to the same logical node, only one of them can succeed, and the other one has to retry the operation.

结构改造操作

Structural Modification Operations

ABw-Tree 的逻辑结构类似于 B-Tree,这意味着节点仍然可能变得太大(溢出)或缩小到几乎为空(下溢),并且需要结构修改操作 (SMO),例如拆分和合并。这里的拆分和合并的语义与 B-Tree 类似(参见“B-Tree 节点拆分”“B-Tree 节点合并”),但它们的实现不同。

A Bw-Tree is logically structured like a B-Tree, which means that nodes still might grow to be too large (overflow) or shrink to be almost empty (underflow) and require structure modification operations (SMOs), such as splits and merges. The semantics of splits and merges here are similar to those of B-Trees (see “B-Tree Node Splits” and “B-Tree Node Merges”), but their implementation is different.

拆分 SMO首先合并分裂节点的逻辑内容,将增量应用到其基本节点,并创建一个包含分裂点右侧元素的新页面。此后,该过程分两步进行[WANG18]

Split SMOs start by consolidating the logical contents of the splitting node, applying deltas to its base node, and creating a new page containing elements to the right of the split point. After this, the process proceeds in two steps [WANG18]:

  1. 分裂—A特殊的分裂增量节点被附加到分裂节点,以通知读者正在进行的分裂。分裂增量节点持有一个中点分隔符键以使分裂节点中的记录无效,以及到新逻辑兄弟节点的链接。

  2. Split—A special split delta node is appended to the splitting node to notify the readers about the ongoing split. The split delta node holds a midpoint separator key to invalidate records in the splitting node, and a link to the new logical sibling node.

  3. 家长更新— 于此时,情况与B链接-Tree类似 half-split(参见“Blink-Trees”),因为该节点可以通过分裂增量节点指针获得,但还没有被父节点引用,读者必须先遍历旧节点,然后遍历兄弟指针才能到达新创建的兄弟节点。新节点作为子节点添加到父节点中,这样读者就可以直接到达它,而不是通过分裂节点重定向,分裂完成。

  4. Parent update—At this point, the situation is similar to that of the Blink-Tree half-split (see “Blink-Trees”), since the node is available through the split delta node pointer, but is not yet referenced by the parent, and readers have to go through the old node and then traverse the sibling pointer to reach the newly created sibling node. A new node is added as a child to the parent node, so that readers can directly reach it instead of being redirected through the splitting node, and the split completes.

更新父指针是一种性能优化:即使父指针从未更新,所有节点及其元素仍然可访问。Bw-Tree 是无闩锁的,因此任何线程都可能遇到不完整的 SMO。在继续之前,线程需要通过拾取并完成多步骤 SMO 来进行合作。下一个线程将遵循已安装的父指针,而不必通过同级指针。

Updating the parent pointer is a performance optimization: all nodes and their elements remain accessible even if the parent pointer is never updated. Bw-Trees are latch-free, so any thread can encounter an incomplete SMO. The thread is required to cooperate by picking up and finishing a multistep SMO before proceeding. The next thread will follow the installed parent pointer and won’t have to go through the sibling pointer.

合并 SMO以类似的方式工作:

Merge SMOs work in a similar way:

  1. 删除同级-A创建特殊的删除增量节点并将其附加到同级,指示合并 SMO 的开始并将右同级标记为删除。

  2. Remove sibling—A special remove delta node is created and appended to the right sibling, indicating the start of the merge SMO and marking the right sibling for deletion.

  3. 合并—A 在左同级上创建合并增量节点,以指向右同级的内容,并使其成为左同级的逻辑部分。

  4. Merge—A merge delta node is created on the left sibling to point to the contents of the right sibling and making it a logical part of the left sibling.

  5. 父更新——此时,可以从左侧访问右侧同级节点的内容。要完成合并过程,必须从父级中删除到右同级的链接。

  6. Parent update—At that point, the right sibling node contents are accessible from the left one. To finish the merge process, the link to the right sibling has to be removed from the parent.

并发 SMO 需要在父节点上安装额外的中止增量节点,以防止并发拆分和合并[WANG18]。中止增量的工作方式与写锁类似:一次只有一个线程可以具有写访问权限,并且任何尝试向此增量节点追加新记录的线程都将中止。SMO 完成后,可以从父级中删除中止增量。

Concurrent SMOs require an additional abort delta node to be installed on the parent to prevent concurrent splits and merges [WANG18]. An abort delta works similarly to a write lock: only one thread can have write access at a time, and any thread that attempts to append a new record to this delta node will abort. On SMO completion, the abort delta can be removed from the parent.

Bw-Tree 高度在根节点分裂期间增长。当根节点变得太大时,它会被分成两部分,并创建一个新的根来代替旧的根,并以旧的根和新创建的兄弟作为其子节点。

The Bw-Tree height grows during the root node splits. When the root node gets too big, it is split in two, and a new root is created in place of the old one, with the old root and a newly created sibling as its children.

整合和垃圾收集

Consolidation and Garbage Collection

三角洲链无需任何额外操作即可获得任意长的长度。由于随着 Delta 链变长,读取的成本也越来越高,因此我们需要尝试将 Delta 链长度保持在合理的范围内。当它达到可配置的阈值时,我们通过将基本节点内容与所有增量合并来重建节点,将它们合并到一个新的基本节点。然后,新节点被写入磁盘上的新位置,并且映射表中的节点指针被更新以指向它。我们在“LLAMA 和 Mindful Stacking”中更详细地讨论了这个过程,因为底层日志结构存储负责垃圾收集、节点合并和重新定位。

Delta chains can get arbitrarily long without any additional action. Since reads are getting more expensive as the delta chain gets longer, we need to try to keep the delta chain length within reasonable bounds. When it reaches a configurable threshold, we rebuild the node by merging the base node contents with all of the deltas, consolidating them to one new base node. The new node is then written to the new location on disk and the node pointer in the mapping table is updated to point to it. We discuss this process in more detail in “LLAMA and Mindful Stacking”, as the underlying log-structured storage is responsible for garbage collection, node consolidation, and relocation.

一旦节点被合并,其旧内容(基本节点和所有增量节点)就不再从映射表中寻址。但是,我们无法立即释放它们占用的内存,因为其中一些可能仍在被正在进行的操作使用。由于读取器没有持有锁存器(读取器不必通过或在任何类型的屏障处注册即可访问节点),因此我们需要找到其他方法来跟踪活动页面。

As soon as the node is consolidated, its old contents (the base node and all of the delta nodes) are no longer addressed from the mapping table. However, we cannot free the memory they occupy right away, because some of them might be still used by ongoing operations. Since there are no latches held by readers (readers did not have to pass through or register at any sort of barrier to access the node), we need to find other means to track live pages.

为了将可能遇到特定节点的线程与不可能看到该节点的线程分开,Bw-Trees使用一种称为基于纪元的回收的技术。如果由于在某个时期替换了某些节点和增量而从映射表中删除了某些节点和增量,则原始节点将被保留,直到在同一时期或更早的时期开始的每个读取器完成。之后,它们可以被安全地进行垃圾收集,因为后来的读者保证永远不会看到这些节点,因为在这些读者开始时它们是不可寻址的。

To separate threads that might have encountered a specific node from those that couldn’t have possibly seen it, Bw-Trees use a technique known as epoch-based reclamation. If some nodes and deltas are removed from the mapping table due to consolidations that replaced them during some epoch, original nodes are preserved until every reader that started during the same epoch or the earlier one is finished. After that, they can be safely garbage collected, since later readers are guaranteed to have never seen those nodes, as they were not addressable by the time those readers started.

Bw-Tree 是一个有趣的 B-Tree 变体,在几个重要方面进行了改进:写放大、非阻塞访问和缓存友好性。修改后的版本在Sled中实现,这是一个实验性存储引擎。CMU 数据库组开发了 Bw-Tree 的内存版本,称为OpenBw-Tree,并发布了实用的实施指南[WANG18]

The Bw-Tree is an interesting B-Tree variant, making improvements on several important aspects: write amplification, nonblocking access, and cache friendliness. A modified version was implemented in Sled, an experimental storage engine. The CMU Database Group has developed an in-memory version of the Bw-Tree called OpenBw-Tree and released a practical implementation guide [WANG18].

我们在本章中只涉及了与 B-Tree 相关的更高级别的 Bw-Tree 概念,我们将在“LLAMA 和 Mindful Stacking”中继续讨论它们,包括对底层日志结构存储的讨论。

We’ve only touched on higher-level Bw-Tree concepts related to B-Trees in this chapter, and we continue the discussion about them in “LLAMA and Mindful Stacking”, including the discussion about the underlying log-structured storage.

忽略缓存的 B 树

Cache-Oblivious B-Trees

堵塞大小、节点大小、缓存行对齐方式和其他可配置参数会影响 B 树性能。一类新的数据结构称为高速缓存忽略结构 [DEMAINE02],无论底层内存层次结构如何以及是否需要调整这些参数,都可以提供渐近最优性能。这意味着算法不需要知道缓存行、文件系统块和磁盘页面的大小。缓存无关结构旨在在具有不同配置的多台计算机上无需修改即可良好运行。

Block size, node size, cache line alignments, and other configurable parameters influence B-Tree performance. A new class of data structures called cache-oblivious structures [DEMAINE02] give asymptotically optimal performance regardless of the underlying memory hierarchy and a need to tune these parameters. This means that the algorithm is not required to know the sizes of the cache lines, filesystem blocks, and disk pages. Cache-oblivious structures are designed to perform well without modification on multiple machines with different configurations.

到目前为止,我们主要从两级内存层次结构来研究 B 树( “写入时复制”中描述的 LMDB 除外)。B-Tree 节点存储在磁盘驻留页面中,页面缓存用于允许在主内存中有效访问它们。

So far, we’ve been mostly looking at B-Trees from a two-level memory hierarchy (with the exception of LMDB described in “Copy-on-Write”). B-Tree nodes are stored in disk-resident pages, and the page cache is used to allow efficient access to them in main memory.

该层次结构的两个级别是页缓存(速度更快,但空间有限)和磁盘(通常速度较慢,但​​容量较大)[AGGARWAL88]。在这里,我们只有两个参数,这使得设计算法变得相当容易,因为我们只需要两个特定于级别的代码模块来处理与该级别相关的所有细节。

The two levels of this hierarchy are page cache (which is faster, but is limited in space) and disk (which is generally slower, but has a larger capacity) [AGGARWAL88]. Here, we have only two parameters, which makes it rather easy to design algorithms as we only have to have two level-specific code modules that take care of all the details relevant to that level.

磁盘被分区为块,数据在磁盘和缓存之间以块的形式传输:即使算法必须在块中定位单个项目,也必须加载整个块。这方法是缓存感知的

The disk is partitioned into blocks, and data is transferred between disk and cache in blocks: even when the algorithm has to locate a single item within the block, an entire block has to be loaded. This approach is cache-aware.

在开发性能关键型软件时,我们经常针对更复杂的模型进行编程,考虑 CPU 缓存,有时甚至考虑磁盘层次结构(例如热/冷存储或构建 HDD/SSD/NVM 层次结构,并逐步将数据从一个级别逐步减少到一个级别)。另一个)。大多数时候,此类努力很难一概而论。在“内存与基于磁盘的 DBMS”中,我们讨论了访问磁盘比访问主内存慢几个数量级的事实,这促使数据库实施者针对这种差异进行优化。

When developing performance-critical software, we often program for a more complex model, taking into consideration CPU caches, and sometimes even disk hierarchies (like hot/cold storage or build HDD/SSD/NVM hierarchies, and phase off data from one level to the other). Most of the time such efforts are difficult to generalize. In “Memory- Versus Disk-Based DBMS”, we talked about the fact that accessing disk is several orders of magnitude slower than accessing main memory, which has motivated database implementers to optimize for this difference.

缓存无关算法允许根据两级内存模型对数据结构进行推理,同时提供多级层次结构模型的优点。这种方法允许没有特定于平台的参数,但保证层次结构的两个级别之间的传输数量在恒定因子内。如果数据结构被优化为对于存储器层次结构的任意两个级别都能够最佳地执行,那么它对于两个相邻的层次结构级别也可以最佳地工作。这是通过尽可能在最高缓存级别上工作来实现的。

Cache-oblivious algorithms allow reasoning about data structures in terms of a two-level memory model while providing the benefits of a multilevel hierarchy model. This approach allows having no platform-specific parameters, yet guarantees that the number of transfers between the two levels of the hierarchy is within a constant factor. If the data structure is optimized to perform optimally for any two levels of memory hierarchy, it also works optimally for the two adjacent hierarchy levels. This is achieved by working at the highest cache level as much as possible.

范·埃姆德·博阿斯布局

van Emde Boas Layout

A缓存忽略 B 树由静态 B 树和打包数组结构[BENDER05]组成。使用van Emde Boas布局构建静态 B 树。它在边缘的中间层分割树。然后以类似的方式递归地分割每个子树,从而产生sqr(N)大小的子树。这种布局的关键思想是任何递归树都存储在连续的内存块中。

A cache-oblivious B-Tree consists of a static B-Tree and a packed array structure [BENDER05]. A static B-Tree is built using the van Emde Boas layout. It splits the tree at the middle level of the edges. Then each subtree is split recursively in a similar manner, resulting in subtrees of sqr(N) size. The key idea of this layout is that any recursive tree is stored in a contiguous block of memory.

图 6-8中,您可以看到 van Emde Boas 布局的示例。逻辑上分组在一起的节点紧密地放置在一起。在顶部,您可以看到逻辑布局表示(即节点如何形成树),在底部您可以看到树节点如何在内存和磁盘上布局。

In Figure 6-8, you can see an example of a van Emde Boas layout. Nodes, logically grouped together, are placed closely together. On top, you can see a logical layout representation (i.e., how nodes form a tree), and on the bottom you can see how tree nodes are laid out in memory and on disk.

数据库0608
图 6-8。范·埃姆德·博阿斯布局

为了使数据结构动态化(即允许插入、更新和删除),缓存不经意的树使用打包数组数据结构,该结构使用连续的内存来存储元素,但包含为将来插入的元素保留的间隙。间隙根据密度阈值进行间隔。图 6-9显示了一个压缩数组结构,其中元素间隔开以形成间隙。

To make the data structure dynamic (i.e., allow inserts, updates, and deletes), cache-oblivious trees use a packed array data structure, which uses contiguous memory segments for storing elements, but contains gaps reserved for future inserted elements. Gaps are spaced based on the density threshold. Figure 6-9 shows a packed array structure, where elements are spaced to create gaps.

0609 号
图 6-9。压缩数组

这种方法允许以更少的重定位将项目插入到树中。如果间隙尚不存在,则必须重新定位项目才能为新插入的元素创建间隙。当打包数组变得过于密集或稀疏时,必须重建结构以增大或缩小数组。

This approach allows inserting items into the tree with fewer relocations. Items have to be relocated just to create a gap for the newly inserted element, if the gap is not already present. When the packed array becomes too densely or sparsely populated, the structure has to be rebuilt to grow or shrink the array.

静态树用作底层打包数组的索引,并且必须根据重定位的元素进行更新以指向底层的正确元素。

The static tree is used as an index for the bottom-level packed array, and has to be updated in accordance with relocated elements to point to correct elements on the bottom level.

这是一种有趣的方法,其中的想法可用于构建高效的 B 树实现。它允许以与构建主内存结构非常相似的方式构建磁盘结构。然而,截至撰写本文时,我还不知道有任何非学术性的忽略缓存的 B 树实现。

This is an interesting approach, and ideas from it can be used to build efficient B-Tree implementations. It allows constructing on-disk structures in ways that are very similar to how main memory ones are constructed. However, as of the date of writing, I’m not aware of any nonacademic cache-oblivious B-Tree implementations.

一个可能的原因是假设当缓存加载被抽象出来时,当数据以块的形式加载和写回时,分页和逐出仍然会对结果产生负面影响。另一个可能的原因是,就块传输而言,缓存无关的 B 树的复杂性与其缓存感知的对应物相同。当更高效的非易失性字节可寻址存储设备变得更加普遍时,这种情况可能会改变。

A possible reason for that is an assumption that when cache loading is abstracted away, while data is loaded and written back in blocks, paging and eviction still have a negative impact on the result. Another possible reason is that in terms of block transfers, the complexity of cache-oblivious B-Trees is the same as their cache-aware counterpart. This may change when more efficient nonvolatile byte-addressable storage devices become more widespread.

概括

Summary

最初的 B 树设计有几个缺点,这些缺点可能在旋转磁盘上运行良好,但在 SSD 上使用时效率较低。B 树具有较高的写入放大(由页面重写引起)和较高的空间开销,因为 B 树必须在节点中为将来的写入保留空间。

The original B-Tree design has several shortcomings that might have worked well on spinning disks, but make it less efficient when used on SSDs. B-Trees have high write amplification (caused by page rewrites) and high space overhead since B-Trees have to reserve space in nodes for future writes.

使用缓冲可以减少写放大。惰性 B 树(例如WiredTiger和 LA 树)将内存缓冲区附加到单个节点或节点组,通过缓冲内存中页面的后续更新来减少所需 I/O 操作的数量。

Write amplification can be reduced by using buffering. Lazy B-Trees, such as WiredTiger and LA-Trees, attach in-memory buffers to individual nodes or groups of nodes to reduce the number of required I/O operations by buffering subsequent updates to pages in memory.

为了减少空间放大,FD-Tree 使用不变性:数据记录存储在不可变的排序游程中,而可变 B-Tree 的大小是有限的。

To reduce space amplification, FD-Trees use immutability: data records are stored in the immutable sorted runs, and the size of a mutable B-Tree is limited.

Bw-Trees 也通过使用不变性来解决空间放大问题。B 树节点及其更新存储在单独的磁盘位置,并持久保存在日志结构存储中。与原始 B 树设计相比,写入放大减少了,因为协调属于单个逻辑节点的内容相对不频繁。Bw-Tree 不需要锁存器来保护页面免受并发访问,因为逻辑节点之间的虚拟指针存储在内存中。

Bw-Trees solve space amplification by using immutability, too. B-Tree nodes and updates to them are stored in separate on-disk locations and persisted in the log-structured store. Write amplification is reduced compared to the original B-Tree design, since reconciling contents that belong to a single logical node is relatively infrequent. Bw-Trees do not require latches for protecting pages from concurrent accesses, as the virtual pointers between the logical nodes are stored in memory.

1要了解有关 LMDB 的更多信息,请参阅代码注释演示文稿

1 To learn more about LMDB, see the code comments and the presentation.

2这不是一个普遍认可的名称,但由于我们在这里讨论的 B-Tree 变体共享一个属性——在中间结构中缓冲 B-Tree 更新而不是直接将它们应用于树——我们将使用术语“lazy” 即相当精确地定义了这个属性。

2 This is not a commonly recognized name, but since the B-Tree variants we’re discussing here share one property—buffering B-Tree updates in intermediate structures instead of applying them to the tree directly—we’ll use the term lazy, which rather precisely defines this property.

第 7 章日志结构存储

Chapter 7. Log-Structured Storage

会计师不使用橡皮擦,否则他们最终会入狱。

帕特·海兰

Accountants don’t use erasers or they end up in jail.

Pat Helland

当会计师必须修改记录时,他们不会删除现有值,而是创建一个经过更正的新记录。季度报告发布时,可能会进行细微修改,修正上一季度的业绩。要得出底线,您必须检查记录并计算小计[HELLAND15]

When accountants have to modify the record, instead of erasing the existing value, they create a new record with a correction. When the quarterly report is published, it may contain minor modifications, correcting the previous quarter results. To derive the bottom line, you have to go through the records and calculate a subtotal [HELLAND15].

相似地,不可变的存储结构不允许修改现有文件:表被写入一次,并且永远不会再次修改。相反,新记录会附加到新文件中,并且为了找到最终值(或断定其不存在),必须从多个文件重建记录。相比之下,可变存储结构可以就地修改磁盘上的记录。

Similarly, immutable storage structures do not allow modifications to the existing files: tables are written once and are never modified again. Instead, new records are appended to the new file and, to find the final value (or conclude its absence), records have to be reconstructed from multiple files. In contrast, mutable storage structures modify records on disk in place.

不可变数据结构经常在函数式编程语言中使用,并且由于其安全特性而变得越来越流行:不可变结构一旦创建就不会改变,它的所有引用都可以并发访问,并且它的完整性由以下事实保证:它不能被修改。

Immutable data structures are often used in functional programming languages and are getting more popular because of their safety characteristics: once created, an immutable structure doesn’t change, all of its references can be accessed concurrently, and its integrity is guaranteed by the fact that it cannot be modified.

从较高的层面来看,存储结构内部和外部的数据处理方式存在严格区别。在内部,不可变文件可以保存多个副本,最新的副本会覆盖旧的副本,而可变文件通常只保存最新的值。访问时,会处理不可变文件,协调冗余副本,并将最新的副本返回给客户端。

On a high level, there is a strict distinction between how data is treated inside a storage structure and outside of it. Internally, immutable files can hold multiple copies, more recent ones overwriting the older ones, while mutable files generally hold only the most recent value instead. When accessed, immutable files are processed, redundant copies are reconciled, and the most recent ones are returned to the client.

作为在有关该主题的其他书籍和论文中,我们使用 B 树作为可变结构的典型示例,使用日志结构合并树(LSM 树)作为不可变结构的示例。不可变 LSM 树的使用仅附加存储和合并协调, B 树在磁盘上定位数据记录并在文件中的原始偏移量处更新页面。

As do other books and papers on the subject, we use B-Trees as a typical example of mutable structure and Log-Structured Merge Trees (LSM Trees) as an example of an immutable structure. Immutable LSM Trees use append-only storage and merge reconciliation, and B-Trees locate data records on disk and update pages at their original offsets in the file.

就地更新存储结构得到优化对于读取性能[GRAEFE04]:在磁盘上定位数据后,可以将记录返回给客户端。这是以牺牲写入性能为代价的:要就地更新数据记录,它首先必须位于磁盘上。另一方面,仅附加存储针对以下情况进行了优化:写入性能。写入不必在磁盘上找到记录即可覆盖它们。然而,这是以读取为代价的,读取必须检索多个数据记录版本并协调它们。

In-place update storage structures are optimized for read performance [GRAEFE04]: after locating data on disk, the record can be returned to the client. This comes at the expense of write performance: to update the data record in place, it first has to be located on disk. On the other hand, append-only storage is optimized for write performance. Writes do not have to locate records on disk to overwrite them. However, this is done at the expense of reads, which have to retrieve multiple data record versions and reconcile them.

到目前为止,我们主要讨论的是可变存储结构。我们在讨论写时复制 B 树(请参阅“写时复制”)、FD 树(请参阅“FD 树”)和 Bw 树(请参阅“Bw -树”)。但还有更多方法来实现不可变结构。

So far we’ve mostly talked about mutable storage structures. We’ve touched on the subject of immutability while discussing copy-on-write B-Trees (see “Copy-on-Write”), FD-Trees (see “FD-Trees”), and Bw-Trees (see “Bw-Trees”). But there are more ways to implement immutable structures.

由于可变 B 树采用的结构和构造方法,读、写和维护期间的大多数 I/O 操作都是随机的。每次写操作首先需要找到保存数据记录的页,然后才能对其进行修改。B 树需要节点拆分和合并,以重新定位已写入的记录。一段时间后,B-Tree 页面可能需要维护。页面的大小是固定的,并且保留一些可用空间以供将来写入。另一个问题是,即使仅修改页面中的一个单元,也必须重写整个页面。

Because of the structure and construction approach taken by mutable B-Trees, most I/O operations during reads, writes, and maintenance are random. Each write operation first needs to locate a page that holds a data record and only then can modify it. B-Trees require node splits and merges that relocate already written records. After some time, B-Tree pages may require maintenance. Pages are fixed in size, and some free space is reserved for future writes. Another problem is that even when only one cell in the page is modified, an entire page has to be rewritten.

有一些替代方法可以帮助缓解这些问题,使某些 I/O 操作按顺序进行,并避免在修改期间重写页面。实现此目的的方法之一是使用不可变结构。在本章中,我们将重点介绍 LSM 树:它们是如何构建的、它们的属性是什么以及它们与 B 树有何不同。

There are alternative approaches that can help to mitigate these problems, make some of the I/O operations sequential, and avoid page rewrites during modifications. One of the ways to do this is to use immutable structures. In this chapter, we’ll focus on LSM Trees: how they’re built, what their properties are, and how they are different from B-Trees.

LSM树

LSM Trees

什么时候谈到 B 树,我们得出的结论是,可以通过使用缓冲来改善空间开销和写入放大。一般有两种方式缓冲可以应用于不同的存储结构:推迟对磁盘驻留页面的写入传播(正如我们在“FD-Trees”“WiredTiger”中看到的那样),并使写入操作顺序进行。

When talking about B-Trees, we concluded that space overhead and write amplification can be improved by using buffering. Generally, there are two ways buffering can be applied in different storage structures: to postpone propagating writes to disk-resident pages (as we’ve seen with “FD-Trees” and “WiredTiger”), and to make write operations sequential.

LSM Tree 是最流行的不可变磁盘存储结构之一,它使用缓冲和仅附加存储以实现顺序写入。LSM 树是类似于 B 树的磁盘驻留结构的变体,其中节点被完全占用,针对顺序磁盘访问进行了优化。这个概念首先由 Patrick O'Neil 和 Edward Cheng 在一篇论文中引入[ONEIL96]。日志结构合并树的名字来源于日志结构文件系统,它将所有修改写入磁盘上的类似日志的文件中[ROSENBLUM92]

One of the most popular immutable on-disk storage structures, LSM Tree uses buffering and append-only storage to achieve sequential writes. The LSM Tree is a variant of a disk-resident structure similar to a B-Tree, where nodes are fully occupied, optimized for sequential disk access. This concept was first introduced in a paper by Patrick O’Neil and Edward Cheng [ONEIL96]. Log-structured merge trees take their name from log-structured filesystems, which write all modifications on disk in a log-like file [ROSENBLUM92].

笔记

LSM 树写入不可变的文件并随着时间的推移将它们合并在一起。这些文件通常包含自己的索引,以帮助读者有效地定位数据。尽管 LSM 树经常作为 B 树的替代品出现,但 B 树通常用作 LSM 树不可变文件的内部索引结构。

LSM Trees write immutable files and merge them together over time. These files usually contain an index of their own to help readers efficiently locate data. Even though LSM Trees are often presented as an alternative to B-Trees, it is common for B-Trees to be used as the internal indexing structure for an LSM Tree’s immutable files.

LSM 树中的“合并”一词表示,由于其不变性,树内容使用类似于合并排序的方法进行合并。这种情况发生在维护期间,以回收冗余副本占用的空间,以及在读取期间,在内容返回给用户之前。

The word “merge” in LSM Trees indicates that, due to their immutability, tree contents are merged using an approach similar to merge sort. This happens during maintenance to reclaim space occupied by the redundant copies, and during reads, before contents can be returned to the user.

LSM 树延迟内存驻留表中的数据文件写入和缓冲区更改。然后,通过将这些更改的内容写入不可变的磁盘文件来传播这些更改。所有数据记录在内存中仍可访问,直到文件完全持久化。

LSM Trees defer data file writes and buffer changes in a memory-resident table. These changes are then propagated by writing their contents out to the immutable disk files. All data records remain accessible in memory until the files are fully persisted.

保持数据文件不可变有利于顺序写入:数据一次性写入磁盘,文件只能追加。可变结构可以在单遍中预分配块(例如,索引顺序访问方法 (ISAM) [RAMAKRISHNAN03] [LARSON81]),但后续访问仍然需要随机读取和写入。不可变结构允许我们按顺序布置数据记录以防止碎片。此外,不可变文件具有更高的 密度:我们不会为稍后写入的数据记录或更新的记录需要比原始写入的记录更多的空间的情况保留任何额外的空间。

Keeping data files immutable favors sequential writes: data is written on the disk in a single pass and files are append-only. Mutable structures can pre-allocate blocks in a single pass (for example, indexed sequential access method (ISAM) [RAMAKRISHNAN03] [LARSON81]), but subsequent accesses still require random reads and writes. Immutable structures allow us to lay out data records sequentially to prevent fragmentation. Additionally, immutable files have higher density: we do not reserve any extra space for data records that are going to be written later, or for the cases when updated records require more space than the originally written ones.

由于文件是不可变的,插入、更新和删除操作不需要在磁盘上定位数据记录,这显着提高了写入性能和吞吐量。相反,允许重复内容,并在读取期间解决冲突。LSM 树对于写入比读取更为常见的应用程序特别有用,鉴于数据量和摄取率不断增长,现代数据密集型系统中经常出现这种情况。

Since files are immutable, insert, update, and delete operations do not need to locate data records on disk, which significantly improves write performance and throughput. Instead, duplicate contents are allowed, and conflicts are resolved during the read time. LSM Trees are particularly useful for applications where writes are far more common than reads, which is often the case in modern data-intensive systems, given ever-growing amounts of data and ingest rates.

按照设计,读写不交叉,因此磁盘上的数据无需段锁定即可读写,从而显着简化了并发访问。相比之下,可变结构采用分层锁和闩锁(您可以在“并发控制”中找到有关锁和闩锁的更多信息)来确保磁盘数据结构的完整性,并允许多个并发读取器,但需要写入器独占子树所有权。基于 LSM 的存储引擎使用数据和索引文件的线性化内存视图,并且只需保护对管理它们的结构的并发访问。

Reads and writes do not intersect by design, so data on disk can be read and written without segment locking, which significantly simplifies concurrent access. In contrast, mutable structures employ hierarchical locks and latches (you can find more information about locks and latches in “Concurrency Control”) to ensure on-disk data structure integrity, and allow multiple concurrent readers but require exclusive subtree ownership for writers. LSM-based storage engines use linearizable in-memory views of data and index files, and only have to guard concurrent access to the structures managing them.

B 树和 LSM 树都需要一些内务处理来优化性能,但原因不同。由于分配的文件数量稳步增长,LSM Tree 必须合并和重写文件,以确保在读取过程中访问尽可能少的文件数量,因为请求的数据记录可能分布在多个文件中。另一方面,可变文件可能必须部分或全部重写,以减少碎片并回收更新或删除的记录占用的空间。当然,内务处理的具体工作范围在很大程度上取决于具体的实施。

Both B-Trees and LSM Trees require some housekeeping to optimize performance, but for different reasons. Since the number of allocated files steadily grows, LSM Trees have to merge and rewrite files to make sure that the smallest possible number of files is accessed during the read, as requested data records might be spread across multiple files. On the other hand, mutable files may have to be rewritten partially or wholly to decrease fragmentation and reclaim space occupied by updated or deleted records. Of course, the exact scope of work done by the housekeeping process heavily depends on the concrete implementation.

LSM树结构

LSM Tree Structure

我们从有序的 LSM 树[ONEIL96]开始,其中文件保存排序的数据记录。稍后,在“无序LSM存储”中,我们还将讨论允许按插入顺序存储数据记录的结构,这在写入路径上具有一些明显的优势。

We start with ordered LSM Trees [ONEIL96], where files hold sorted data records. Later, in “Unordered LSM Storage”, we’ll also discuss structures that allow storing data records in insertion order, which has some obvious advantages on the write path.

正如我们刚刚讨论的,LSM 树由较小的内存驻留组件和较大的磁盘驻留组件组成。要将不可变的文件内容写到磁盘上,需要首先将它们缓冲在内存中并对其内容进行排序。

As we just discussed, LSM Trees consist of smaller memory-resident and larger disk-resident components. To write out immutable file contents on disk, it is necessary to first buffer them in memory and sort their contents.

A内存驻留组件(通常称为memtable)是可变的:它缓冲数据记录并充当读写操作的目标。当 Memtable 的大小增长到可配置的阈值时,其内容将保留在磁盘上。Memtable 更新不会产生磁盘访问,也没有相关的 I/O 成本。需要一个单独的预写日志文件,类似于我们在“恢复”中讨论的内容,以保证数据记录的持久性。在客户端确认操作之前,数据记录会附加到日志中并提交到内存中。

A memory-resident component (often called a memtable) is mutable: it buffers data records and serves as a target for read and write operations. Memtable contents are persisted on disk when its size grows up to a configurable threshold. Memtable updates incur no disk access and have no associated I/O costs. A separate write-ahead log file, similar to what we discussed in “Recovery”, is required to guarantee durability of data records. Data records are appended to the log and committed in memory before the operation is acknowledged to the client.

缓冲在内存中完成:所有读写操作都应用于内存驻留表,该表维护允许并发访问的排序数据结构,通常是某种形式的内存排序树,或可以提供类似性能特征的任何数据结构。

Buffering is done in memory: all read and write operations are applied to a memory-resident table that maintains a sorted data structure allowing concurrent access, usually some form of an in-memory sorted tree, or any data structure that can give similar performance characteristics.

磁盘驻留组件是由 将内存中缓冲的内容刷新到磁盘。磁盘驻留组件仅用于读取:缓冲的内容被持久化,并且文件永远不会被修改。这使我们能够思考简单的操作:针对内存表进行写入,针对磁盘和基于内存的表进行读取、合并和文件删除。

Disk-resident components are built by flushing contents buffered in memory to disk. Disk-resident components are used only for reads: buffered contents are persisted, and files are never modified. This allows us to think in terms of simple operations: writes against an in-memory table, and reads against disk and memory-based tables, merges, and file removals.

自始至终本章,我们将使用单词table作为磁盘驻留表的快捷方式。由于我们正在讨论存储引擎的语义,因此该术语与更广泛的数据库管理系统上下文中的表概念并无歧义

Throughout this chapter, we will be using the word table as a shortcut for disk-resident table. Since we’re discussing semantics of a storage engine, this term is not ambiguous with a table concept in the wider context of a database management system.

二元LSM树

Two-component LSM Tree

我们区分双成分和多成分 LSM 树。两部分 LSM 树只有一个磁盘部分,由不可变的段组成。这里的磁盘组件被组织为B-Tree,节点占用率100%,页面只读。

We distinguish between two- and multicomponent LSM Trees. Two-component LSM Trees have only one disk component, comprised of immutable segments. The disk component here is organized as a B-Tree, with 100% node occupancy and read-only pages.

内存驻留树内容部分刷新到磁盘上。在刷新期间,对于每个刷新的内存子树,我们在磁盘上找到对应的子树,并将内存驻留段和磁盘驻留子树的合并内容写入磁盘上的新段。图 7-1显示了合并之前内存中和磁盘驻留的树。

Memory-resident tree contents are flushed on disk in parts. During a flush, for each flushed in-memory subtree, we find a corresponding subtree on disk and write out the merged contents of a memory-resident segment and disk-resident subtree into the new segment on disk. Figure 7-1 shows in-memory and disk-resident trees before a merge.

数据库0701
图 7-1。刷新之前的两部分 LSM 树。刷新内存和磁盘驻留段以灰色显示。

刷新子树后,被取代的内存驻留子树和磁盘驻留子树将被丢弃并替换为它们的合并结果,该结果可从磁盘驻留树的预先存在的部分进行寻址。图 7-2显示了合并过程的结果,已写入磁盘上的新位置并附加到树的其余部分。

After the subtree is flushed, superseded memory-resident and disk-resident subtrees are discarded and replaced with the result of their merge, which becomes addressable from the preexisting sections of the disk-resident tree. Figure 7-2 shows the result of a merge process, already written to the new location on disk and attached to the rest of the tree.

数据库0702
图 7-2。刷新后的两部分 LSM 树。合并的内容以灰色显示。虚线框表示丢弃的磁盘段。

可以通过推进迭代器以锁步方式读取磁盘驻留叶节点和内存中树的内容来实现合并。由于两个源都已排序,为了生成排序的合并结果,我们只需要在合并过程的每个步骤中知道两个迭代器的当前值。

A merge can be implemented by advancing iterators reading the disk-resident leaf nodes and contents of the in-memory tree in lockstep. Since both sources are sorted, to produce a sorted merged result, we only need to know the current values of both iterators during each step of the merge process.

这种方法是我们关于不可变 B 树的对话的逻辑扩展和延续。写时复制B-Tree(参见“写时复制”)使用B-Tree结构,但它们的节点没有被完全占用,并且需要复制根叶路径上的页面并创建并行树结构。在这里,我们做了类似的事情,但是由于我们在内存中缓冲写入,因此我们分摊了磁盘驻留树更新的成本。

This approach is a logical extension and continuation of our conversation on immutable B-Trees. Copy-on-write B-Trees (see “Copy-on-Write”) use B-Tree structure, but their nodes are not fully occupied, and they require copying pages on the root-leaf path and creating a parallel tree structure. Here, we do something similar, but since we buffer writes in memory, we amortize the costs of the disk-resident tree update.

在实现子树合并和刷新时,我们必须确保三件事:

When implementing subtree merges and flushes, we have to make sure of three things:

  1. 一旦刷新过程开始,所有新的写入都必须写入新的内存表。

  2. As soon as the flush process starts, all new writes have to go to the new memtable.

  3. 子树刷新期间,磁盘驻留子树和刷新内存驻留子树都必须保持可访问以供读取。

  4. During the subtree flush, both the disk-resident and flushing memory-resident subtree have to remain accessible for reads.

  5. 刷新后,必须以原子方式执行发布合并的内容以及丢弃未合并的磁盘和内存驻留内容。

  6. After the flush, publishing merged contents, and discarding unmerged disk- and memory-resident contents have to be performed atomically.

尽管两部分 LSM 树对于维护索引文件很有用,但截至撰写本文时,作者还不知道任何实现。这可以通过这种方法的写入放大特征来解释:合并相对频繁,因为它们是由内存表刷新触发的。

Even though two-component LSM Trees can be useful for maintaining index files, no implementations are known to the author as of time of writing. This can be explained by the write amplification characteristics of this approach: merges are relatively frequent, as they are triggered by memtable flushes.

多组件 LSM 树

Multicomponent LSM Trees

让我们考虑一下另一种设计是多组件 LSM 树,它不仅仅有一个磁盘驻留表。在这种情况下,整个内存表内容将在一次运行中刷新。

Let’s consider an alternative design, multicomponent LSM Trees that have more than just one disk-resident table. In this case, entire memtable contents are flushed in a single run.

很快就会发现,在多次刷新之后,我们最终会得到多个磁盘驻留表,并且它们的数量只会随着时间的推移而增长。由于我们并不总是确切地知道哪些表保存了所需的数据记录,因此我们可能必须访问多个文件才能找到搜索到的数据。

It quickly becomes evident that after multiple flushes we’ll end up with multiple disk-resident tables, and their number will only grow over time. Since we do not always know exactly which tables are holding required data records, we might have to access multiple files to locate the searched data.

必须从多个来源而不是仅从一个来源进行读取可能会变得昂贵。为了缓解这个问题并将表的数量保持在最低限度,称为定期合并过程 压缩(参见“LSM 树的维护”)被触发。压缩选择几个表,读取它们的内容,合并它们,并将合并结果写入新的组合文件。旧表在新合并表出现的同时被丢弃。

Having to read from multiple sources instead of just one might get expensive. To mitigate this problem and keep the number of tables to minimum, a periodic merge process called compaction (see “Maintenance in LSM Trees”) is triggered. Compaction picks several tables, reads their contents, merges them, and writes the merged result out to the new combined file. Old tables are discarded simultaneously with the appearance of the new merged table.

图 7-3显示了多组件 LSM Tree 数据生命周期。数据首先缓冲在内存驻留组件中。当它变得太大时,其内容将刷新到磁盘上以创建磁盘驻留表。随后,多个表被合并在一起以创建更大的表。

Figure 7-3 shows the multicomponent LSM Tree data life cycle. Data is first buffered in a memory-resident component. When it gets too large, its contents are flushed on disk to create disk-resident tables. Later, multiple tables are merged together to create larger tables.

数据库0703
图 7-3。多组件 LSM Tree 数据生命周期

本章的其余部分专门讨论多组件 LSM 树、构建块及其维护过程。

The rest of this chapter is dedicated to multicomponent LSM Trees, building blocks, and their maintenance processes.

内存表

In-memory tables

内存表刷新可以定期触发,或者通过使用大小阈值来触发。在刷新之前,必须切换内存表分配一个新的内存表,它成为所有新写入的目标,而旧的则进入刷新状态。这两个步骤必须以原子方式执行。刷新内存表仍然可供读取,直到其内容完全刷新。此后,旧的内存表将被丢弃,取而代之的是新写入的磁盘驻留表,该表可供读取。

Memtable flushes can be triggered periodically, or by using a size threshold. Before it can be flushed, the memtable has to be switched: a new memtable is allocated, and it becomes a target for all new writes, while the old one moves to the flushing state. These two steps have to be performed atomically. The flushing memtable remains available for reads until its contents are fully flushed. After this, the old memtable is discarded in favor of a newly written disk-resident table, which becomes available for reads.

图 7-4中,您可以看到 LSM Tree 的组件、它们之间的关系以及实现它们之间转换的操作:

In Figure 7-4, you see the components of the LSM Tree, relationships between them, and operations that fulfill transitions between them:

当前内存表
Current memtable

接收写入并提供读取服务。

Receives writes and serves reads.

刷新内存表
Flushing memtable

可供阅读。

Available for reads.

磁盘刷新目标
On-disk flush target

不参与读取,因为其内容不完整。

Does not participate in reads, as its contents are incomplete.

冲洗桌子
Flushed tables

一旦刷新的内存表被丢弃,就可以读取。

Available for reads as soon as the flushed memtable is discarded.

压实台
Compacting tables

当前正在合并磁盘驻留表。

Currently merging disk-resident tables.

压缩表
Compacted tables

从刷新或其他压缩表创建。

Created from flushed or other compacted tables.

数据库0704
图 7-4。LSM组件结构

数据已经在内存中排序,因此可以通过将内存驻留内容顺序写入磁盘来创建磁盘驻留表。在刷新期间,刷新内存表和当前内存表都可供读取。

Data is already sorted in memory, so a disk-resident table can be created by sequentially writing out memory-resident contents to disk. During a flush, both the flushing memtable and the current memtable are available for read.

在内存表完全刷新之前,其内容的唯一驻留在磁盘的版本将存储在预写日志中。当memtable内容完全刷新到磁盘上时,日志可以修剪,并且可以丢弃应用于刷新的内存表的保存操作的日志部分。

Until the memtable is fully flushed, the only disk-resident version of its contents is stored in the write-ahead log. When memtable contents are fully flushed on disk, the log can be trimmed, and the log section, holding operations applied to the flushed memtable, can be discarded.

更新和删除

Updates and Deletes

LSM Trees、插入、更新和删除操作不需要在磁盘上定位数据记录。相反,冗余记录会在读取期间进行协调。

In LSM Trees, insert, update, and delete operations do not require locating data records on disk. Instead, redundant records are reconciled during the read.

从内存表中删除数据记录是不够的,因为其他磁盘或内存驻留表可能保存相同键的数据记录。如果我们仅通过从内存表中删除项目来实现删除,那么最终的删除要么没有影响,要么会恢复以前的值。

Removing data records from the memtable is not enough, since other disk or memory resident tables may hold data records for the same key. If we were to implement deletes by just removing items from the memtable, we would end up with deletes that either have no impact or would resurrect the previous values.

让我们考虑一个例子。刷新的磁盘驻留表包含v1与某个键关联的数据记录k1,并且内存表保存其新值v2

Let’s consider an example. The flushed disk-resident table contains data record v1 associated with a key k1, and the memtable holds its new value v2:

磁盘表内存表
| k1 | v1 | | k1 | v2 |
Disk Table        Memtable
| k1 | v1 |       | k1 | v2 |

如果我们只是v2从内存表中删除并刷新它,我们就可以有效地复活v1,因为它成为与该键关联的唯一值:

If we just remove v2 from the memtable and flush it, we effectively resurrect v1, since it becomes the only value associated with that key:

磁盘表内存表
| k1 | v1 | ∅
Disk Table        Memtable
| k1 | v1 |       ∅

因为其中,删除需要明确记录。这可以通过插入特殊的删除条目(有时称为逻辑删除休眠证书)来完成,指示删除与特定密钥关联的数据记录:

Because of that, deletes need to be recorded explicitly. This can be done by inserting a special delete entry (sometimes called a tombstone or a dormant certificate), indicating removal of the data record associated with a specific key:

磁盘表内存表
| k1 | v1 | | k1 | <墓碑> |
Disk Table        Memtable
| k1 | v1 |       | k1 | <tombstone> |

协调过程会拾取逻辑删除,并过滤掉隐藏的值。

The reconciliation process picks up tombstones, and filters out the shadowed values.

有时,删除连续范围的键而不是单个键可能会很有用。这可以做到使用谓词删除,其工作原理是在删除条目上附加一个根据常规记录排序规则进行排序的谓词。在协调期间,与谓词匹配的数据记录将被跳过并且不会返回给客户端。

Sometimes it might be useful to remove a consecutive range of keys rather than just a single key. This can be done using predicate deletes, which work by appending a delete entry with a predicate that sorts according to regular record-sorting rules. During reconciliation, data records matching the predicate are skipped and not returned to the client.

谓词可以采用 的形式DELETE FROM table WHERE key ≥ "k2" AND key < "k4"并且可以接收任何范围匹配器。阿帕奇卡桑德拉实现这种方法并调用它的墓碑不等。范围墓碑涵盖一系列键,而不仅仅是一个键。

Predicates can take a form of DELETE FROM table WHERE key ≥ "k2" AND key < "k4" and can receive any range matchers. Apache Cassandra implements this approach and calls it range tombstones. A range tombstone covers a range of keys rather than just a single key.

使用范围逻辑删除时,由于范围重叠和驻留在磁盘上的表边界,必须仔细考虑解析规则。例如,以下组合将隐藏与最终结果关联的数据k2记录k3

When using range tombstones, resolution rules have to be carefully considered because of overlapping ranges and disk-resident table boundaries. For example, the following combination will hide data records associated with k2 and k3 from the final result:

磁盘表 1 磁盘表 2
| k1 | v1 | | k2| <start_tombstone_inclusive> |
| k2| v2 | | k4 | k4 | <end_tombstone_exclusive> |
| k3 | v3 |
| k4 | k4 | v4 |
Disk Table 1      Disk Table 2
| k1 | v1 |       | k2 | <start_tombstone_inclusive> |
| k2 | v2 |       | k4 | <end_tombstone_exclusive>   |
| k3 | v3 |
| k4 | v4 |

LSM 树查找

LSM Tree Lookups

LSM树由多个组件组成。在查找过程中,通常会访问多个组件,因此必须合并并协调它们的内容,然后才能将其返回给客户端。为了更好地理解合并过程,让我们看看合并过程中表是如何迭代的以及冲突的记录是如何合并的。

LSM Trees consist of multiple components. During lookups, more than one component is usually accessed, so their contents have to be merged and reconciled before they can be returned to the client. To better understand the merge process, let’s see how tables are iterated during the merge and how conflicting records are combined.

合并迭代

Merge-Iteration

自从磁盘驻留表的内容已排序,我们可以使用多路合并排序算法。例如,我们有三个源:两个磁盘驻留表和一个内存表。通常,存储引擎提供光标迭代器来浏览文件内容。该游标保存最后消耗的数据记录的偏移量,可以检查迭代是否完成,并可用于检索下一个数据记录。

Since contents of disk-resident tables are sorted, we can use a multiway merge-sort algorithm. For example, we have three sources: two disk-resident tables and one memtable. Usually, storage engines offer a cursor or an iterator to navigate through file contents. This cursor holds the offset of the last consumed data record, can be checked for whether or not iteration has finished, and can be used to retrieve the next data record.

多路归并排序使用优先级队列,例如min-heap [SEDGEWICK11],它可以容纳N元素(其中N是迭代器的数量),它对其内容进行排序并准备要返回的下一个最小元素。每个迭代器的头部被放入队列中。队列头部的元素是所有迭代器中的最小值。

A multiway merge-sort uses a priority queue, such as min-heap [SEDGEWICK11], that holds up to N elements (where N is the number of iterators), which sorts its contents and prepares the next-in-line smallest element to be returned. The head of each iterator is placed into the queue. An element in the head of the queue is then the minimum of all iterators.

笔记

优先级队列是一种用于维护项目有序队列的数据结构。常规队列按添加顺序(先进先出)保留项目,而优先级队列会在插入时重新排序项目,并将具有最高(或最低)优先级的项目放置在队列的头部。这对于合并迭代特别有用,因为我们必须按排序顺序输出元素。

A priority queue is a data structure used for maintaining an ordered queue of items. While a regular queue retains items in order of their addition (first in, first out), a priority queue re-sorts items on insertion and the item with the highest (or lowest) priority is placed in the head of the queue. This is particularly useful for merge-iteration, since we have to output elements in a sorted order.

当从队列中删除最小元素时,将检查与其关联的迭代器是否有下一个值,然后将其放入队列中,并重新排序以保留顺序。

When the smallest element is removed from the queue, the iterator associated with it is checked for the next value, which is then placed into the queue, which is re-sorted to preserve the order.

由于所有迭代器内容都已排序,因此从保存所有迭代器头的先前最小值的迭代器中重新插入值也会保留一个不变量,即队列仍然保存所有迭代器中的最小元素。每当其中一个迭代器耗尽时,算法就会继续进行,而不会重新插入下一个迭代器头。该算法将继续下去,直到满足查询条件或用尽所有迭代器。

Since all iterator contents are sorted, reinserting a value from the iterator that held the previous smallest value of all iterator heads also preserves an invariant that the queue still holds the smallest elements from all iterators. Whenever one of the iterators is exhausted, the algorithm proceeds without reinserting the next iterator head. The algorithm continues until either query conditions are satisfied or all iterators are exhausted.

图 7-5显示了刚刚描述的合并过程的示意图:头元素(源表中的浅灰色项)被放置到优先级队列中。优先级队列中的元素将返回到输出迭代器。结果输出已排序。

Figure 7-5 shows a schematic representation of the merge process just described: head elements (light gray items in source tables) are placed to the priority queue. Elements from the priority queue are returned to the output iterator. The resulting output is sorted.

在合并迭代过程中,我们可能会遇到同一键的多个数据记录。从优先级队列和迭代器不变量中,我们知道,如果每个迭代器每个键只保存一条数据记录,并且最终队列中同一键有多个记录,那么这些数据记录一定来自不同的迭代器。

It may happen that we encounter more than one data record for the same key during merge-iteration. From the priority queue and iterator invariants, we know that if each iterator only holds a single data record per key, and we end up with multiple records for the same key in the queue, these data records must have come from the different iterators.

数据库0705
图 7-5。LSM 合并机制

让我们一步一步地看一个例子。作为输入数据,我们在两个磁盘驻留表上有迭代器:

Let’s follow through one example step-by-step. As input data, we have iterators over two disk-resident tables:

迭代器 1: 迭代器 2:
{k2: v1} {k4: v2} {k1: v3} {k2: v4} {k3: v5}
Iterator 1:         Iterator 2:
{k2: v1} {k4: v2}   {k1: v3} {k2: v4} {k3: v5}

优先级队列从迭代器头开始填充:

The priority queue is filled from the iterator heads:

迭代器 1: 迭代器 2: 优先级队列:
{k4:v2} {k2:v4} {k3:v5} {k1:v3} {k2:v1}
Iterator 1:         Iterator 2:         Priority queue:
{k4: v2}            {k2: v4} {k3: v5}   {k1: v3} {k2: v1}

Keyk1是队列中最小的键,并附加到结果中。由于它来自Iterator 2,我们从中重新填充队列:

Key k1 is the smallest key in the queue and is appended to the result. Since it came from Iterator 2, we refill the queue from it:

迭代器 1: 迭代器 2: 优先级队列: 合并结果:
{k4:v2} {k3:v5} {k2:v1} {k2:v4} {k1:v3}
Iterator 1:         Iterator 2:         Priority queue:      Merged Result:
{k4: v2}            {k3: v5}            {k2: v1} {k2: v4}    {k1: v3}

现在,队列中的键有两条记录。k2由于上述不变量,我们可以确定任何迭代器中都不存在具有相同键的其他记录。相同键记录被合并并附加到合并结果中。

Now, we have two records for the k2 key in the queue. We can be sure there are no other records with the same key in any iterator because of the aforementioned invariants. Same-key records are merged and appended to the merged result.

队列重新填充来自两个迭代器的数据:

The queue is refilled with data from both iterators:

迭代器 1: 迭代器 2: 优先级队列: 合并结果:
{} {} {k3: v5} {k4: v2} {k1: v3} {k2: v4}
Iterator 1:         Iterator 2:         Priority queue:      Merged Result:
{}                  {}                  {k3: v5} {k4: v2}    {k1: v3} {k2: v4}

由于所有迭代器现在都是空的,我们将剩余的队列内容附加到输出中:

Since all iterators are now empty, we append the remaining queue contents to the output:

合并结果:
  {k1:v3} {k2:v4} {k3:v5} {k4:v2}
Merged Result:
  {k1: v3} {k2: v4} {k3: v5} {k4: v2}

总之,必须重复以下步骤来创建组合迭代器:

In summary, the following steps have to be repeated to create a combined iterator:

  1. 最初,用每个迭代器的第一个项目填充队列。

  2. Initially, fill the queue with the first items from each iterator.

  3. 从队列中取出最小的元素(头)。

  4. Take the smallest element (head) from the queue.

  5. 从相应的迭代器重新填充队列,除非该迭代器已耗尽。

  6. Refill the queue from the corresponding iterator, unless this iterator is exhausted.

就复杂性而言,合并迭代器与合并排序集合相同。它有O(N)内存开销,其中N是迭代器的数量。迭代器头的排序集合由O(log N)(平均情况)[KNUTH98]维护。

In terms of complexity, merging iterators is the same as merging sorted collections. It has O(N) memory overhead, where N is the number of iterators. A sorted collection of iterator heads is maintained with O(log N) (average case) [KNUTH98].

和解

Reconciliation

合并迭代这只是合并多个来源的数据所需执行的操作的一个方面。另一个重要方面是与同一密钥关联的数据记录的协调冲突解决。

Merge-iteration is just a single aspect of what has to be done to merge data from multiple sources. Another important aspect is reconciliation and conflict resolution of the data records associated with the same key.

不同的表可能保存相同键的数据记录,例如更新和删除,并且它们的内容必须协调。前面示例中的优先级队列实现必须能够允许与同一键关联的多个值并触发协调。

Different tables might hold data records for the same key, such as updates and deletes, and their contents have to be reconciled. The priority queue implementation from the preceding example must be able to allow multiple values associated with the same key and trigger reconciliation.

笔记

如果记录不存在则将其插入数据库,否则更新现有记录的操作称为更新插入。在LSM树中,插入和更新操作是无法区分的,因为它们不会尝试在所有源中定位先前与该键关联的数据记录并重新分配其值,因此我们可以说我们默认更新插入记录

An operation that inserts the record to the database if it does not exist, and updates an existing one otherwise, is called an upsert. In LSM Trees, insert and update operations are indistinguishable, since they do not attempt to locate data records previously associated with the key in all sources and reassign its value, so we can say that we upsert records by default.

为了协调数据记录,我们需要了解其中哪一项优先。数据记录保存为此所需的元数据,例如时间戳。为了建立来自多个来源的项目之间的顺序并找出哪一个较新,我们可以比较它们的时间戳。

To reconcile data records, we need to understand which one of them takes precedence. Data records hold metadata necessary for this, such as timestamps. To establish the order between the items coming from multiple sources and find out which one is more recent, we can compare their timestamps.

具有较高时间戳的记录所遮蔽的记录不会返回给客户端或在压缩期间写入。

Records shadowed by the records with higher timestamps are not returned to the client or written during compaction.

LSM 树的维护

Maintenance in LSM Trees

相似的对于可变 B 树,LSM 树需要维护。这些过程的性质很大程度上受到这些算法所保留的不变量的影响。

Similar to mutable B-Trees, LSM Trees require maintenance. The nature of these processes is heavily influenced by the invariants these algorithms preserve.

在 B 树中,维护过程收集未引用的单元格并对页面进行碎片整理,回收已删除和隐藏记录所占用的空间。在LSM Trees中,磁盘驻留表的数量不断增长,但可以通过触发定期压缩来减少。

In B-Trees, the maintenance process collects unreferenced cells and defragments the pages, reclaiming the space occupied by removed and shadowed records. In LSM Trees, the number of disk-resident tables is constantly growing, but can be reduced by triggering periodic compaction.

压缩选择多个磁盘驻留表,使用上述合并和协调算法迭代其整个内容,并将结果写入新创建的表中。

Compaction picks multiple disk-resident tables, iterates over their entire contents using the aforementioned merge and reconciliation algorithms, and writes out the results into the newly created table.

由于磁盘驻留表内容已排序,并且由于合并排序的工作方式,压缩具有理论上的内存使用上限,因为它应该仅将迭代器头保存在内存中。所有表内容都按顺序消费,并且生成的合并数据也按顺序写出。由于额外的优化,这些细节可能因实现而异。

Since disk-resident table contents are sorted, and because of the way merge-sort works, compaction has a theoretical memory usage upper bound, since it should only hold iterator heads in memory. All table contents are consumed sequentially, and the resulting merged data is also written out sequentially. These details may vary between implementations due to additional optimizations.

压缩表在压缩过程完成之前一直可供读取,这意味着在压缩期间,磁盘上需要有足够的可用空间来写入压缩表。

Compacting tables remain available for reads until the compaction process finishes, which means that for the duration of compaction, it is required to have enough free space available on disk for a compacted table to be written.

在任何给定时间,系统中都可以执行多次压缩。然而,这些并发压缩通常适用于不相交的表集。压缩编写器既可以将多个表合并为一个表,也可以将一个表分区为多个表。

At any given time, multiple compactions can be executed in the system. However, these concurrent compactions usually work on nonintersecting sets of tables. A compaction writer can both merge several tables into one and partition one table into multiple tables.

平整压实

Leveled compaction

压实为优化提供了多种机会,并且有许多不同的压缩策略。经常实施的压缩策略之一称为分层压缩例如, RocksDB使用它。

Compaction opens up multiple opportunities for optimizations, and there are many different compaction strategies. One of the frequently implemented compaction strategies is called leveled compaction. For example, it is used by RocksDB.

分级压缩将磁盘驻留表分成多个级别。每个级别的表都有目标大小,每个级别都有相应的索引号(标识符)。有点违反直觉的是,指数最高的级别称为底层。为了清楚起见,本节避免使用术语“更高级别”和“更低级别”,并使用相同的限定词来表示级别索引。也就是说,由于 2 大于 1,因此级别 2 的索引高于级别 1。术语“上一个”和“下一个”与级别索引具有相同的顺序语义。

Leveled compaction separates disk-resident tables into levels. Tables on each level have target sizes, and each level has a corresponding index number (identifier). Somewhat counterintuitively, the level with the highest index is called the bottommost level. For clarity, this section avoids using terms higher and lower level and uses the same qualifiers for level index. That is, since 2 is larger than 1, level 2 has a higher index than level 1. The terms previous and next have the same order semantics as level indexes.

0 级表是通过刷新内存表内容来创建的。0 级表可能包含重叠的键范围。一旦级别 0 上的表数量达到阈值,它们的内容就会合并,为级别 1 创建新表。

Level-0 tables are created by flushing memtable contents. Tables in level 0 may contain overlapping key ranges. As soon as the number of tables on level 0 reaches a threshold, their contents are merged, creating new tables for level 1.

1 级表和具有较高索引的所有级别的表的键范围不重叠,因此 0 级表必须在压缩过程中进行分区,拆分为范围,并与包含相应键范围的表合并。或者,压缩可以包括所有0 级和 1 级表,并输出分区的 1 级表。

Key ranges for the tables on level 1 and all levels with a higher index do not overlap, so level-0 tables have to be partitioned during compaction, split into ranges, and merged with tables holding corresponding key ranges. Alternatively, compaction can include all level-0 and level-1 tables, and output partitioned level-1 tables.

对具有较高索引的级别进行压缩会从具有重叠范围的两个连续级别中选取表,并在较高级别上生成一个新表。图7-6示意性地显示了压缩过程如何在级别之间迁移数据。压缩 1 级和 2 级表的过程将在 2 级上生成一个新表。根据表的分区方式,可以选择来自一个级别的多个表进行压缩。

Compactions on the levels with the higher indexes pick tables from two consecutive levels with overlapping ranges and produce a new table on a higher level. Figure 7-6 schematically shows how the compaction process migrates data between the levels. The process of compacting level-1 and level-2 tables will produce a new table on level 2. Depending on how tables are partitioned, multiple tables from one level can be picked for compaction.

数据库0706
图 7-6。压实过程。带虚线的灰色框表示当前正在压缩的表。级别范围的框表示级别上的目标数据大小限制。1 级已超出限制。

在不同的表中保留不同的键范围可以减少读取期间访问的表的数量。这是通过检查表元数据并过滤掉范围不包含搜索键的表来完成的。

Keeping different key ranges in the distinct tables reduces the number of tables accessed during the read. This is done by inspecting the table metadata and filtering out the tables whose ranges do not contain a searched key.

每个级别都有表大小和最大表数量的限制。一旦第 1 级或任何具有更高索引的级别上的表数量达到阈值,当前级别中的表就会与持有重叠键范围的下一个级别上的表合并。

Each level has a limit on the table size and the maximum number of tables. As soon as the number of tables on level 1 or any level with a higher index reaches a threshold, tables from the current level are merged with tables on the next level holding the overlapping key range.

级别之间的大小呈指数级增长:每个下一级上的表都比上一级上的表呈指数级增长。这样,最新的数据始终处于索引最低的级别,而较旧的数据逐渐迁移到索引较高的级别。

Sizes grow exponentially between the levels: tables on each next level are exponentially larger than tables on the previous one. This way, the freshest data is always on the level with the lowest index, and older data gradually migrates to the higher ones.

尺寸分级压缩

Size-tiered compaction

其他流行的压缩策略称为大小分层压缩。在大小分层压缩中,不是根据级别对磁盘驻留表进行分组,而是按大小进行分组:较小的表与较小的表分组,较大的表与较大的表分组。

Another popular compaction strategy is called size-tiered compaction. In size-tiered compaction, rather than grouping disk-resident tables based on their level, they’re grouped by size: smaller tables are grouped with smaller ones, and bigger tables are grouped with bigger ones.

级别 0 保存从内存表刷新或由压缩过程创建的最小表。当表被压缩时,生成的合并表将被写入具有相应大小的保存表的级别。该过程继续递归地增加级别、压缩较大的表并将其提升到更高的级别,以及将较小的表降级到较低的级别。

Level 0 holds the smallest tables that were either flushed from memtables or created by the compaction process. When the tables are compacted, the resulting merged table is written to the level holding tables with corresponding sizes. The process continues recursively incrementing levels, compacting and promoting larger tables to higher levels, and demoting smaller tables to lower levels.

警告

大小分层压缩的问题称为表饥饿:如果压缩后的表仍然足够小(例如,记录被逻辑删除遮蔽并且没有进入合并表),则更高级别可能会缺乏压缩,并且他们的墓碑不会被考虑在内,从而增加了读取的成本。在这种情况下,必须对某个级别强制压缩,即使它不包含足够的表。

One of the problems with size-tiered compaction is called table starvation: if compacted tables are still small enough after compaction (e.g., records were shadowed by the tombstones and did not make it to the merged table), higher levels may get starved of compaction and their tombstones will not be taken into consideration, increasing the cost of reads. In this case, compaction has to be forced for a level, even if it doesn’t contain enough tables.

那里是其他常用的压缩策略,可以针对不同的工作负载进行优化。例如,Apache Cassandra 还实现了时间窗口压缩策略,这对于具有设置了生存时间的记录的时间序列工作负载(换句话说,项目必须在给定时间段后过期)特别有用。

There are other commonly implemented compaction strategies that might optimize for different workloads. For example, Apache Cassandra also implements a time window compaction strategy, which is particularly useful for time-series workloads with records for which time-to-live is set (in other words, items have to be expired after a given time period).

时间窗口压缩策略考虑了写入时间戳,并允许删除在已经过期的时间范围内保存数据的整个文件,而不需要我们压缩和重写其内容。

The time window compaction strategy takes write timestamps into consideration and allows dropping entire files that hold data for an already expired time range without requiring us to compact and rewrite their contents.

读、写和空间放大

Read, Write, and Space Amplification

实施时一个最佳的压缩策略,我们必须考虑多种因素。一种方法是回收重复记录占用的空间,减少空间开销,这会导致不断重写表而导致更高的写放大。另一种方法是避免连续重写数据,这会增加读取放大(在读取期间协调与同一键关联的数据记录的开销)和空间放大(因为冗余记录会保留更长的时间)。

When implementing an optimal compaction strategy, we have to take multiple factors into consideration. One approach is to reclaim space occupied by duplicate records and reduce space overhead, which results in higher write amplification caused by re-writing tables continuously. The alternative is to avoid rewriting the data continuously, which increases read amplification (overhead from reconciling data records associated with the same key during the read), and space amplification (since redundant records are preserved for a longer time).

笔记

数据库界最大的争议之一是 B-Tree 和 LSM Tree 是否具有较低的写放大。了解这两种情况下写放大的来源非常重要。在 B 树中,它来自写回操作和对同一节点的后续更新。在 LSM 树中,写放大是由于在压缩期间将数据从一个文件迁移到另一个文件而引起的。直接比较两者可能会导致错误的假设。

One of the big disputes in the database community is whether B-Trees or LSM Trees have lower write amplification. It is extremely important to understand the source of write amplification in both cases. In B-Trees, it comes from writeback operations and subsequent updates to the same node. In LSM Trees, write amplification is caused by migrating data from one file to the other during compaction. Comparing the two directly may lead to incorrect assumptions.

总而言之,当以不可变的方式将数据存储在磁盘上时,我们面临三个问题:

In summary, when storing data on disk in an immutable fashion, we face three problems:

读取扩增
Read amplification

由于需要处理多个表来检索数据。

Resulting from a need to address multiple tables to retrieve data.

写放大
Write amplification

由压缩过程的连续重写引起。

Caused by continuous rewrites by the compaction process.

空间放大
Space amplification

由于存储与同一键关联的多个记录而产生。

Arising from storing multiple records associated with the same key.

我们将在本章的其余部分中逐一讨论这些问题。

We’ll be addressing each one of these throughout the rest of the chapter.

朗姆猜想

RUM Conjecture

流行的存储结构成本模型考虑了三个因素:读取更新内存开销。它被称为 RUM 猜想[ATHANASSOULIS16]

One of the popular cost models for storage structures takes three factors into consideration: Read, Update, and Memory overheads. It is called RUM Conjecture [ATHANASSOULIS16].

RUM 猜想指出,减少其中两项开销不可避免地会导致第三项开销变得更糟,并且只能以牺牲三个参数之一为代价来完成优化。我们可以根据这三个参数来比较不同的存储引擎,以了解它们针对哪些参数进行优化,以及这可能意味着哪些潜在的权衡。

RUM Conjecture states that reducing two of these overheads inevitably leads to change for the worse in the third one, and that optimizations can be done only at the expense of one of the three parameters. We can compare different storage engines in terms of these three parameters to understand which ones they optimize for, and which potential trade-offs this may imply.

理想的解决方案将提供最低的读取成本,同时保持较低的内存和写入开销,但实际上,这是无法实现的,我们需要进行权衡。

An ideal solution would provide the lowest read cost while maintaining low memory and write overheads, but in reality, this is not achievable, and we are presented with a trade-off.

B 树是读取优化的。写入 B 树需要在磁盘上定位记录,并且后续写入同一页面可能需要多次更新磁盘上的页面。为将来的更新和删除保留额外空间会增加空间开销。

B-Trees are read-optimized. Writes to the B-Tree require locating a record on disk, and subsequent writes to the same page might have to update the page on disk multiple times. Reserved extra space for future updates and deletes increases space overhead.

LSM Tree 不需要在写入期间在磁盘上定位记录,并且不为将来的写入保留额外的空间。存储冗余记录仍然会产生一些空间开销。在默认配置中,读取成本更高,因为必须访问多个表才能返回完整结果。然而,我们在本章中讨论的优化有助于缓解这个问题。

LSM Trees do not require locating the record on disk during write and do not reserve extra space for future writes. There is still some space overhead resulting from storing redundant records. In a default configuration, reads are more expensive, since multiple tables have to be accessed to return complete results. However, optimizations we discuss in this chapter help to mitigate this problem.

正如我们在有关 B 树的章节中以及在本章中看到的那样,有一些方法可以通过应用不同的优化来改进这些特性。

As we’ve seen in the chapters about B-Trees, and will see in this chapter, there are ways to improve these characteristics by applying different optimizations.

这种成本模型并不完美,因为它没有考虑其他重要指标,例如延迟、访问模式、实现复杂性、维护开销和硬件相关细节。对于分布式数据库很重要的更高级别的概念,例如一致性影响和复制开销,也没有被考虑。但是,该模型可以用作第一近似值和经验法则,因为它有助于理解存储引擎必须提供的功能。

This cost model is not perfect, as it does not take into account other important metrics such as latency, access patterns, implementation complexity, maintenance overhead, and hardware-related specifics. Higher-level concepts important for distributed databases, such as consistency implications and replication overhead, are also not considered. However, this model can be used as a first approximation and a rule of thumb as it helps understand what the storage engine has to offer.

实施细节

Implementation Details

我们已经涵盖了 LSM 树的基本动态:如何读取、写入和压缩数据。然而,许多 LSM Tree 实现还有一些共同点值得讨论:如何实现内存驻留表和磁盘驻留表、二级索引如何工作、如何减少读取和读取期间访问的磁盘驻留表的数量。最后,与日志结构存储相关的新想法。

We’ve covered the basic dynamics of LSM Trees: how data is read, written, and compacted. However, there are some other things that many LSM Tree implementations have in common that are worth discussing: how memory- and disk-resident tables are implemented, how secondary indexes work, how to reduce the number of disk-resident tables accessed during read and, finally, new ideas related to log-structured storage.

排序字符串表

Sorted String Tables

到目前为止,我们已经讨论了 LSM 树的层次结构和逻辑结构(它们由多个驻留在内存和磁盘的组件组成),但尚未讨论如何实现磁盘驻留表以及它们的设计如何与其他组件一起使用系统的。

So far we’ve discussed the hierarchical and logical structure of LSM Trees (that they consist of multiple memory- and disk-resident components), but have not yet discussed how disk-resident tables are implemented and how their design plays together with the rest of the system.

磁盘驻留表通常使用排序字符串表(SSTables)来实现。顾名思义,SSTable 中的数据记录按键顺序排序和布局。SSTables通常由两个部分组成:索引文件和数据文件。索引文件是使用某种允许对数查找(例如 B 树)或恒定时间查找(例如哈希表)的结构来实现的。

Disk-resident tables are often implemented using Sorted String Tables (SSTables). As the name suggests, data records in SSTables are sorted and laid out in key order. SSTables usually consist of two components: index files and data files. Index files are implemented using some structure allowing logarithmic lookups, such as B-Trees, or constant-time lookups, such as hashtables.

由于数据文件按键顺序保存记录,因此使用哈希表进行索引并不妨碍我们实现范围扫描,因为哈希表仅用于定位范围中的第一个键,并且范围本身可以从数据文件中顺序读取,而范围谓词仍然匹配。

Since data files hold records in key order, using hashtables for indexing does not prevent us from implementing range scans, as a hashtable is only accessed to locate the first key in the range, and the range itself can be read from the data file sequentially while the range predicate still matches.

索引组件保存键和数据条目(实际数据记录所在的数据文件中的偏移量)。数据组件由串联的键值对组成。我们在第 3 章中讨论的单元格设计和数据记录格式很大程度上适用于 SSTable。这里的主要区别在于,单元格是按顺序写入的,并且在 SSTable 的生命周期内不会被修改。由于索引文件保存指向数据文件中存储的数据记录的指针,因此在创建索引时必须知道它们的偏移量。

The index component holds keys and data entries (offsets in the data file where the actual data records are located). The data component consists of concatenated key-value pairs. The cell design and data record formats we discussed in Chapter 3 are largely applicable to SSTables. The main difference here is that cells are written sequentially and are not modified during the life cycle of the SSTable. Since the index files hold pointers to the data records stored in the data file, their offsets have to be known by the time the index is created.

在压缩期间,可以顺序读取数据文件而无需寻址索引组件,因为其中的数据记录已经排序。由于压缩期间合并的表具有相同的顺序,并且合并迭代是保留顺序的,因此生成的合并表也是通过在单次运行中顺序写入数据记录来创建的。一旦文件被完全写入,它就被认为是不可变的,并且其驻留在磁盘上的内容不会被修改。

During compaction, data files can be read sequentially without addressing the index component, as data records in them are already ordered. Since tables merged during compaction have the same order, and merge-iteration is order-preserving, the resulting merged table is also created by writing data records sequentially in a single run. As soon as the file is fully written, it is considered immutable, and its disk-resident contents are not modified.

布隆过滤器

Bloom Filters

LSM 树中读放大的根源在于我们必须寻址多个磁盘驻留表才能完成读操作。发生这种情况是因为我们并不总是预先知道磁盘驻留表是否包含所搜索键的数据记录。

The source of read amplification in LSM Trees is that we have to address multiple disk-resident tables for the read operation to complete. This happens because we do not always know up front whether or not a disk-resident table contains a data record for the searched key.

防止表查找的方法之一是将其键范围(给定表中存储的最小和最大键)存储在元数据中,并检查搜索的键是否属于该表的范围。该信息并不精确,只能告诉我们该数据记录是否可以存在于表中。为了改善这种情况,包括Apache CassandraRocksDB在内的许多实现都使用称为Bloom filter 的数据结构。

One of the ways to prevent table lookup is to store its key range (smallest and largest keys stored in the given table) in metadata, and check if the searched key belongs to the range of that table. This information is imprecise and can only tell us if the data record can be present in the table. To improve this situation, many implementations, including Apache Cassandra and RocksDB, use a data structure called a Bloom filter.

笔记

概率数据结构通常比“常规”数据结构更节省空间。例如,要检查集合成员资格、基数(找出集合中不同元素的数量)或频率(找出遇到某个元素的次数),我们必须存储所有集合元素并遍历整个数据集来查找结果。概率结构使我们能够存储近似信息并执行查询,从而产生具有不确定性的结果。这种数据结构的一些众所周知的示例是布隆过滤器(用于集合成员资格)、HyperLogLog(用于基数估计)[FLAJOLET12]和Count-Min Sketch(用于频率估计)[CORMODE12]

Probabilistic data structures are generally more space efficient than their “regular” counterparts. For example, to check set membership, cardinality (find out the number of distinct elements in a set), or frequency (find out how many times a certain element has been encountered), we would have to store all set elements and go through the entire dataset to find the result. Probabilistic structures allow us to store approximate information and perform queries that yield results with an element of uncertainty. Some commonly known examples of such data structures are a Bloom filter (for set membership), HyperLogLog (for cardinality estimation) [FLAJOLET12], and Count-Min Sketch (for frequency estimation) [CORMODE12].

布隆过滤器由 Burton Howard Bloom 在 1970 年提出[BLOOM70],是一种节省空间的概率数据结构,可用于测试元素是否是集合的成员。它可以产生误报匹配(假设该元素是集合的成员,但它不存在于集合中),但不能产生漏报匹配(如果返回否定匹配,则保证该元素不是集合的成员)集)。

A Bloom filter, conceived by Burton Howard Bloom in 1970 [BLOOM70], is a space-efficient probabilistic data structure that can be used to test whether the element is a member of the set or not. It can produce false-positive matches (say that the element is a member of the set, while it is not present there), but cannot produce false negatives (if a negative match is returned, the element is guaranteed not to be a member of the set).

换句话说,布隆过滤器可以用来判断键是否可能在表中或者绝对不在表中。查询期间将跳过布隆过滤器返回负匹配的文件。访问其余文件以查明数据记录是否确实存在。使用与磁盘驻留表关联的布隆过滤器有助于显着减少读取期间访问的表的数量。

In other words, a Bloom filter can be used to tell if the key might be in the table or is definitely not in the table. Files for which a Bloom filter returns a negative match are skipped during the query. The rest of the files are accessed to find out if the data record is actually present. Using Bloom filters associated with disk-resident tables helps to significantly reduce the number of tables accessed during a read.

布隆过滤器使用大型位数组和多个哈希函数。哈希函数应用于表中记录的键,以查找位数组中的索引,其位设置为1。由散列函数确定的所有位置中设置的位1指示密钥在集合中的存在。在查找过程中,当检查布隆过滤器中是否存在元素时,会再次计算该键的哈希函数,如果所有哈希函数确定的位均为1,则返回肯定结果,表明该项目以一定的概率是该集合的成员。如果至少有一位是0,我们可以准确地说该元素不存在于集合中。

A Bloom filter uses a large bit array and multiple hash functions. Hash functions are applied to keys of the records in the table to find indices in the bit array, bits for which are set to 1. Bits set to 1 in all positions determined by the hash functions indicate a presence of the key in the set. During lookup, when checking for element presence in a Bloom filter, hash functions are calculated for the key again and, if bits determined by all hash functions are 1, we return the positive result stating that item is a member of the set with a certain probability. If at least one of the bits is 0, we can precisely say that element is not present in the set.

哈希值应用于不同键的函数可以返回相同的位位置并导致哈希冲突,并且1位仅意味着某个哈希函数已经为某个键生成了该位位置。

Hash functions applied to different keys can return the same bit position and result in a hash collision, and 1 bits only imply that some hash function has yielded this bit position for some key.

误报的概率是通过配置位集的大小和哈希函数的数量来管理的:位集越大,冲突的可能性越小;类似地,拥有更多的哈希函数,我们可以检查更多的位并得到更精确的结果。

Probability of false positives is managed by configuring the size of the bit set and the number of hash functions: in a larger bit set, there’s a smaller chance of collision; similarly, having more hash functions, we can check more bits and have a more precise outcome.

较大的位集占用更多的内存,更多的哈希函数的计算结果可能会对性能产生负面影响,因此我们必须在可接受的概率和产生的开销之间找到合理的中间立场。概率可以根据预期的集合大小来计算。由于 LSM 树中的表是不可变的,因此集合大小(表中键的数量)是预先已知的。

The larger bit set occupies more memory, and computing results of more hash functions may have a negative performance impact, so we have to find a reasonable middle ground between acceptable probability and incurred overhead. Probability can be calculated from the expected set size. Since tables in LSM Trees are immutable, set size (number of keys in the table) is known up front.

我们来看一个简单的例子,如图7-7所示。我们有一个 16 路位数组和 3 个哈希函数,它们产生值35和。我们现在在这些位置设置位。添加下一个键,散列函数产生、和for的值,我们也为其设置位。10key15814key2

Let’s take a look at a simple example, shown in Figure 7-7. We have a 16-way bit array and 3 hash functions, which yield values 3, 5, and 10 for key1. We now set bits at these positions. The next key is added and hash functions yield values of 5, 8, and 14 for key2, for which we set bits, too.

数据库0707
图 7-7。布隆过滤器

key3现在,我们尝试检查集合中是否存在,并且哈希函数产生31014key1由于在添加和时设置了所有三个位key2,因此布隆过滤器返回误报的情况:key3从未附加到那里,但所有计算的位都已设置。然而,由于布隆过滤器仅声明该元素可能在表中,因此该结果是可以接受的。

Now, we’re trying to check whether or not key3 is present in the set, and hash functions yield 3, 10, and 14. Since all three bits were set when adding key1 and key2, we have a situation in which the Bloom filter returns a false positive: key3 was never appended there, yet all of the calculated bits are set. However, since the Bloom filter only claims that element might be in the table, this result is acceptable.

如果我们尝试查找并接收、和key4的值,我们会发现只有一位被设置,而其他两位未被设置。如果其中一位未设置,我们就可以确定该元素从未被附加到过滤器中。59155

If we try to perform a lookup for key4 and receive values of 5, 9, and 15, we find that only bit 5 is set, and the other two bits are unset. If even one of the bits is unset, we know for sure that the element was never appended to the filter.

跳表

Skiplist

那里有许多不同的数据结构用于在内存中保存排序数据,其中一种由于其简单性而最近变得越来越流行,称为跳跃列表[ PUGH90b]。在实现方面,跳跃列表并不比单链表复杂多少,并且其概率复杂性保证接近于搜索树。

There are many different data structures for keeping sorted data in memory, and one that has been getting more popular recently because of its simplicity is called a skiplist [PUGH90b]. Implementation-wise, a skiplist is not much more complex than a singly-linked list, and its probabilistic complexity guarantees are close to those of search trees.

Skiplists 不需要旋转或重新定位插入和更新,而是使用概率平衡。与内存中的 B 树相比,跳表通常不太适合缓存,因为跳表节点很小并且在内存中随机分配。一些实现通过使用展开的链接列表来改善这种情况。

Skiplists do not require rotation or relocation for inserts and updates, and use probabilistic balancing instead. Skiplists are generally less cache-friendly than in-memory B-Trees, since skiplist nodes are small and randomly allocated in memory. Some implementations improve the situation by using unrolled linked lists.

跳跃列表由一系列不同高度的节点组成,构建链接的层次结构,允许跳过项目范围。每个节点都拥有一个密钥,并且与链表中的节点不同,某些节点具有多个后继者。高度为 的节点一个或多个高度不超过h的前驱节点链接。最低层的节点可以从任意高度的节点链接。 h

A skiplist consists of a series of nodes of a different height, building linked hierarchies allowing to skip ranges of items. Each node holds a key, and, unlike the nodes in a linked list, some nodes have more than just one successor. A node of height h is linked from one or more predecessor nodes of a height up to h. Nodes on the lowest level can be linked from nodes of any height.

节点高度由随机函数确定,并在插入期间计算。具有相同高度的节点形成一个级别。层数受到限制以避免无限增长,并且最大高度是根据结构可以容纳的物品数量来选择的。每个下一级的节点数量呈指数级减少。

Node height is determined by a random function and is computed during insert. Nodes that have the same height form a level. The number of levels is capped to avoid infinite growth, and a maximum height is chosen based on how many items can be held by the structure. There are exponentially fewer nodes on each next level.

查找通过跟踪最高层的节点指针来进行。一旦搜索遇到拥有搜索到的键更大的键的节点,就会遵循其前任到下一级节点的链接。换句话说,如果搜索到的关键字大于当前节点关键字,则继续向前搜索。如果搜索到的关键字小于当前节点关键字,则从下一级的前驱节点继续搜索。递归地重复此过程,直到找到所搜索的关键字或其前身。

Lookups work by following the node pointers on the highest level. As soon as the search encounters the node that holds a key that is greater than the searched one, its predecessor’s link to the node on the next level is followed. In other words, if the searched key is greater than the current node key, the search continues forward. If the searched key is smaller than the current node key, the search continues from the predecessor node on the next level. This process is repeated recursively until the searched key or its predecessor is located.

例如,在图7-8所示的跳跃列表中搜索键7可以如下完成:

For example, searching for key 7 in the skiplist shown in Figure 7-8 can be done as follows:

  1. 沿着最高层的指针,到达保存 key 的节点10

  2. Follow the pointer on the highest level, to the node that holds key 10.

  3. 由于搜索到的 key7小于,因此 10从头节点开始的下一级指针将定位到持有 key 的节点5

  4. Since the searched key 7 is smaller than 10, the next-level pointer from the head node is followed, locating a node holding key 5.

  5. 跟踪该节点上的最高层指针,10再次定位持有密钥的节点。

  6. The highest-level pointer on this node is followed, locating the node holding key 10 again.

  7. 查找到的关键字7小于,则从保存关键字的节点开始的下一级指针,定位到保存 查找到的关键字 的节点。1057

  8. The searched key 7 is smaller than 10, and the next-level pointer from the node holding key 5 is followed, locating a node holding the searched key 7.

数据库0708
图 7-8。跳表

在插入过程中,使用上述算法找到插入点(持有键或其前驱的节点),并创建一个新节点。为了构建树状层次结构并保持平衡,节点的高度是使用根据概率分布生成的随机数来确定的。前任节点中持有比新创建节点中的键更小的键的指针被链接以指向该节点。他们的高层指针保持不变。新创建的节点中的指针链接到每个级别上相应的后继节点。

During insert, an insertion point (node holding a key or its predecessor) is found using the aforementioned algorithm, and a new node is created. To build a tree-like hierarchy and keep balance, the height of the node is determined using a random number, generated based on a probability distribution. Pointers in predecessor nodes holding keys smaller than the key in a newly created node are linked to point to that node. Their higher-level pointers remain intact. Pointers in the newly created node are linked to corresponding successors on each level.

在删除过程中,被删除节点的前向指针被放置到相应级别的前驱节点。

During delete, forward pointers of the removed node are placed to predecessor nodes on corresponding levels.

我们可以通过实现线性化方案来创建跳跃列表的并发版本,该方案使用一个附加fully_linked标志来确定节点指针是否完全更新。可以使用比较和交换[HERLIHY10]设置此标志。这是必需的,因为必须在多个级别上更新节点指针才能完全恢复跳表结构。

We can create a concurrent version of a skiplist by implementing a linearizability scheme that uses an additional fully_linked flag that determines whether or not the node pointers are fully updated. This flag can be set using compare-and-swap [HERLIHY10]. This is required because the node pointers have to be updated on multiple levels to fully restore the skiplist structure.

具有非托管内存模型的语言、引用计数或危险指针可用于确保当前引用的节点在并发访问时不会被释放[RUSSEL12]。该算法是无死锁的,因为节点总是从更高级别访问。

In languages with an unmanaged memory model, reference counting or hazard pointers can be used to ensure that currently referenced nodes are not freed while they are accessed concurrently [RUSSEL12]. This algorithm is deadlock-free, since nodes are always accessed from higher levels.

Apache Cassandra 使用跳表来实现二级索引 memtable。WiredTiger 使用跳表进行一些内存操作。

Apache Cassandra uses skiplists for the secondary index memtable implementation. WiredTiger uses skiplists for some in-memory operations.

磁盘访问

Disk Access

由于大多数表内容都驻留在磁盘上,并且存储设备通常允许按块访问数据,因此许多 LSM Tree 实现依赖于页面缓存来进行磁盘访问和中间缓存。《缓冲区管理》中描述的许多技术,例如页面驱逐和固定,仍然适用于日志结构存储。

Since most of the table contents are disk-resident, and storage devices generally allow accessing data blockwise, many LSM Tree implementations rely on the page cache for disk accesses and intermediate caching. Many techniques described in “Buffer Management”, such as page eviction and pinning, still apply to log-structured storage.

最显着的区别是内存中的内容是不可变的,因此不需要额外的锁或闩锁来进行并发访问。应用引用计数来确保当前访问的页面不会从内存中逐出,并且在压缩期间删除底层文件之前,正在进行的请求完成。

The most notable difference is that in-memory contents are immutable and therefore require no additional locks or latches for concurrent access. Reference counting is applied to make sure that currently accessed pages are not evicted from memory, and in-flight requests complete before underlying files are removed during compaction.

另一个区别是LSM Tree中的数据记录不一定是页对齐的,并且指针可以使用绝对偏移量而不是页ID来实现寻址。在图7-9中,您可以看到内容与磁盘块不对齐的记录。有些记录跨越页面边界,需要在内存中加载多个页面。

Another difference is that data records in LSM Trees are not necessarily page aligned, and pointers can be implemented using absolute offsets rather than page IDs for addressing. In Figure 7-9, you can see records with contents that are not aligned with disk blocks. Some records cross the page boundaries and require loading several pages in memory.

数据库0709
图 7-9。未对齐的数据记录

压缩

Compression

我们已经已经在 B 树上下文中讨论了压缩(参见“压缩”)。类似的想法也适用于 LSM 树。这里的主要区别是 LSM Tree 表是不可变的,并且通常是在单遍中编写的。当按页压缩数据时,压缩页不是页对齐的,因为它们的大小小于未压缩页的大小。

We’ve discussed compression already in context of B-Trees (see “Compression”). Similar ideas are also applicable to LSM Trees. The main difference here is that LSM Tree tables are immutable, and are generally written in a single pass. When compressing data page-wise, compressed pages are not page aligned, as their sizes are smaller than that of uncompressed ones.

为了能够寻址压缩页面,我们需要在写入内容时跟踪地址边界。我们可以用零填充压缩页面,将它们与页面大小对齐,但这样我们就会失去压缩的好处。

To be able to address compressed pages, we need to keep track of the address boundaries when writing their contents. We could fill compressed pages with zeros, aligning them to the page size, but then we’d lose the benefits of compression.

为了使压缩页面可寻址,我们需要一个间接层来存储压缩页面的偏移量和大小。图 7-10显示了压缩块和未压缩块之间的映射。压缩后的页面总是比原始页面小,否则压缩它们是没有意义的。

To make compressed pages addressable, we need an indirection layer which stores offsets and sizes of compressed pages. Figure 7-10 shows the mapping between compressed and uncompressed blocks. Compressed pages are always smaller than the originals, since otherwise there’s no point in compressing them.

数据库0710
图 7-10。读取压缩块。虚线表示从映射表到磁盘上压缩页偏移量的指针。未压缩的页面通常驻留在页面缓存中。

在compaction和flush过程中,压缩页被顺序追加,压缩信息(原始未压缩页偏移量和实际压缩页偏移量)存储在单独的文件段中。在读取过程中,会查找压缩页偏移量及其大小,并且可以将该页解压缩并在内存中具体化。

During compaction and flush, compressed pages are appended sequentially, and compression information (the original uncompressed page offset and the actual compressed page offset) is stored in a separate file segment. During the read, the compressed page offset and its size are looked up, and the page can be uncompressed and materialized in memory.

无序LSM存储

Unordered LSM Storage

最多到目前为止讨论的存储结构按顺序存储数据。可变和不可变的 B 树页面、FD 树中的排序运行以及 LSM 树中的 SSTable 按键顺序存储数据记录。这些结构中的顺序以不同方式保留:B-Tree 页面就地更新,FD-Tree 运行是通过合并两个运行的内容来创建的,SSTables 是通过在内存中缓冲和排序数据记录来创建的。

Most of the storage structures discussed so far store data in order. Mutable and immutable B-Tree pages, sorted runs in FD-Trees, and SSTables in LSM Trees store data records in key order. The order in these structures is preserved differently: B-Tree pages are updated in place, FD-Tree runs are created by merging contents of two runs, and SSTables are created by buffering and sorting data records in memory.

在本节中,我们讨论以随机顺序存储记录的结构。无序存储通常不需要单独的日志,并且允许我们通过按插入顺序存储数据记录来降低写入成本。

In this section, we discuss structures that store records in random order. Unordered stores generally do not require a separate log and allow us to reduce the cost of writes by storing data records in insertion order.

比特桶

Bitcask

比特桶,一个Riak中使用的存储引擎是一个无序日志结构存储引擎[SHEEHY10b]。与到目前为止讨论的日志结构存储实现不同,它使用内存表进行缓冲,而是将数据记录直接存储在日志文件中。

Bitcask, one of the storage engines used in Riak, is an unordered log-structured storage engine [SHEEHY10b]. Unlike the log-structured storage implementations discussed so far, it does not use memtables for buffering, and stores data records directly in logfiles.

为了使值可搜索,Bitcask 使用称为keydir的数据结构,它保存对相应键的最新数据记录的引用。旧的数据记录可能仍然存在于磁盘上,但不会从 keydir 引用,并且会在压缩期间被垃圾收集。Keydir 被实现为内存中的哈希图,并且必须在启动期间从日志文件重建。

To make values searchable, Bitcask uses a data structure called keydir, which holds references to the latest data records for the corresponding keys. Old data records may still be present on disk, but are not referenced from keydir, and are garbage-collected during compaction. Keydir is implemented as an in-memory hashmap and has to be rebuilt from the logfiles during startup.

在写入过程中,密钥和数据记录会依次追加到日志文件中,并且指向新写入的数据记录位置的指针会放置在 keydir 中。

During a write, a key and a data record are appended to the logfile sequentially, and the pointer to the newly written data record location is placed in keydir.

读取检查 keydir 以找到搜索到的键,并跟随关联的日志文件指针,找到数据记录。由于在任何给定时刻,keydir 中只能有一个与键关联的值,因此点查询不必合并来自多个源的数据。

Reads check the keydir to locate the searched key and follow the associated pointer to the logfile, locating the data record. Since at any given moment there can be only one value associated with the key in the keydir, point queries do not have to merge data from multiple sources.

Bitcask中数据文件中的键和记录的映射关系如图7-11所示。日志文件保存数据记录,keydir 指向与每个键关联的最新实时数据记录。数据文件中的隐藏记录(被后来的写入或删除取代的记录)显示为灰色。

Figure 7-11 shows mapping between the keys and records in data files in Bitcask. Logfiles hold data records, and keydir points to the latest live data record associated with each key. Shadowed records in data files (ones that were superseded by later writes or deletes) are shown in gray.

数据库0711
图 7-11。Bitcask 中 keydir 和数据文件之间的映射。实线表示从键到与其关联的最新值的指针。阴影键/值对以浅灰色显示。

在压缩过程中,所有日志文件的内容都会按顺序读取、合并并写入到新位置,仅保留实时数据记录并丢弃隐藏的数据记录。Keydir 已更新为指向重新定位的数据记录的新指针。

During compaction, contents of all logfiles are read sequentially, merged, and written to a new location, preserving only live data records and discarding the shadowed ones. Keydir is updated with new pointers to relocated data records.

数据记录直接存储在日志文件中,因此不必维护单独的预写日志,这减少了空间开销和写入放大。这种方法的缺点是它只提供点查询并且不允许范围扫描,因为项目在 keydir 和数据文件中都是无序的。

Data records are stored directly in logfiles, so a separate write-ahead log doesn’t have to be maintained, which reduces both space overhead and write amplification. A downside of this approach is that it offers only point queries and doesn’t allow range scans, since items are unordered both in keydir and in data files.

这种方法的优点是简单和出色的点查询性能。即使存在多个版本的数据记录,keydir 也只能寻址最新的版本。然而,必须将所有密钥保留在内存中并在启动时重建 keydir 是一些限制,对于某些用例来说可能会破坏交易。虽然这种方法非常适合点查询,但它不提供对范围查询的任何支持。

Advantages of this approach are simplicity and great point query performance. Even though multiple versions of data records exist, only the latest one is addressed by keydir. However, having to keep all keys in memory and rebuilding keydir on startup are limitations that might be a deal breaker for some use cases. While this approach is great for point queries, it does not offer any support for range queries.

威斯凯

WiscKey

范围查询对于许多应用程序都很重要,如果有一个存储结构能够具有无序存储的写入和空间优势,同时仍然允许我们执行范围扫描,那就太好了。

Range queries are important for many applications, and it would be great to have a storage structure that could have the write and space advantages of unordered storage, while still allowing us to perform range scans.

WiscKey [LU16]通过在 LSM 树中保持键排序,并将数据记录保存在无序的仅附加文件中,将排序与垃圾收集分离称为vLog(价值日志)。这种方法可以解决讨论 Bitcask 时提到的两个问题:需要将所有键保存在内存中并在启动时重建哈希表。

WiscKey [LU16] decouples sorting from garbage collection by keeping the keys sorted in LSM Trees, and keeping data records in unordered append-only files called vLogs (value logs). This approach can solve two problems mentioned while discussing Bitcask: a need to keep all keys in memory and to rebuild a hashtable on startup.

图7-12显示了WiscKey的关键组件以及密钥和日志文件之间的映射。vLog 文件保存无序的数据记录。键存储在排序的 LSM 树中,指向日志文件中的最新数据记录。

Figure 7-12 shows key components of WiscKey, and mapping between keys and log files. vLog files hold unordered data records. Keys are stored in sorted LSM Trees, pointing to the latest data records in the logfiles.

由于键通常比与其关联的数据记录小得多,因此压缩它们的效率要高得多。这种方法对于更新和删除率较低的用例特别有用,在这种情况下,垃圾收集不会释放太多的磁盘空间。

Since keys are typically much smaller than the data records associated with them, compacting them is significantly more efficient. This approach can be particularly useful for use cases with a low rate of updates and deletes, where garbage collection won’t free up as much disk space.

这里的主要挑战是,由于 vLog 数据未排序,范围扫描需要随机 I/O。WiscKey 使用内部 SSD 并行性在范围扫描期间并行预取块并减少随机 I/O 成本。就块传输而言,成本仍然很高:在范围扫描期间要获取单个数据记录,必须读取其所在的整个页。

The main challenge here is that because vLog data is unsorted, range scans require random I/O. WiscKey uses internal SSD parallelism to prefetch blocks in parallel during range scans and reduce random I/O costs. In terms of block transfers, the costs are still high: to fetch a single data record during the range scan, the entire page where it is located has to be read.

数据库0712
图 7-12。WiscKey 的关键组件:索引 LSM 树和 vLog 文件,以及它们之间的关系。数据文件中的隐藏记录(被后来的写入或删除取代的记录)显示为灰色。实线表示从LSM树中的键到日志文件中的最新值的指针。

在压缩期间,vLog 文件内容被顺序读取、合并并写入新位置。指针(关键 LSM 树中的值)被更新以指向这些新位置。为了避免扫描整个 vLog 内容,WiscKey 使用headtail指针,保存有关包含实时密钥的 vLog 段的信息。

During compaction, vLog file contents are read sequentially, merged, and written to a new location. Pointers (values in a key LSM Tree) are updated to point to these new locations. To avoid scanning entire vLog contents, WiscKey uses head and tail pointers, holding information about vLog segments that hold live keys.

由于 vLog 中的数据未排序且不包含活跃信息,因此必须扫描密钥树以查找哪些值仍然活跃。在垃圾收集期间执行这些检查会带来额外的复杂性:传统的 LSM 树可以在压缩期间解析文件内容,而无需寻址关键索引。

Since data in vLog is unsorted and contains no liveness information, the key tree has to be scanned to find which values are still live. Performing these checks during garbage collection introduces additional complexity: traditional LSM Trees can resolve file contents during compaction without addressing the key index.

LSM 树中的并发

Concurrency in LSM Trees

LSM 树中的主要并发挑战与切换表视图(在刷新和压缩期间更改的内存和磁盘驻留表的集合)和日志同步有关。Memtable 通常也是并发访问的(除了 ScyllaDB 等核心分区存储),但并发内存数据结构超出了本书的范围。

The main concurrency challenges in LSM Trees are related to switching table views (collections of memory- and disk-resident tables that change during flush and compaction) and log synchronization. Memtables are also generally accessed concurrently (except core-partitioned stores such as ScyllaDB), but concurrent in-memory data structures are out of the scope of this book.

冲洗时,必须遵守以下规则:

During flush, the following rules have to be followed:

  • 新的内存表必须可供读取和写入。

  • The new memtable has to become available for reads and writes.

  • 旧的(正在刷新的)内存表必须保持可见以供读取。

  • The old (flushing) memtable has to remain visible for reads.

  • 刷新内存表必须写入磁盘。

  • The flushing memtable has to be written on disk.

  • 丢弃刷新的内存表并创建刷新的磁盘驻留表必须作为原子操作执行。

  • Discarding a flushed memtable and making a flushed disk-resident table have to be performed as an atomic operation.

  • 必须丢弃保存应用于刷新内存表的操作日志条目的预写日志段。

  • The write-ahead log segment, holding log entries of operations applied to the flushed memtable, has to be discarded.

例如,Apache Cassandra 通过使用操作顺序屏障解决了这些问题:所有接受写入的操作都将在内存表刷新之前等待。这样,刷新进程(充当消费者)就知道哪些其他进程(充当生产者)依赖于它。

For example, Apache Cassandra solves these problems by using operation order barriers: all operations that were accepted for write will be waited upon prior to the memtable flush. This way the flush process (serving as a consumer) knows which other processes (acting as producers) depend on it.

更一般地,我们有以下同步点:

More generally, we have the following synchronization points:

内存表开关
Memtable switch

这样,所有写入仅写入新的内存表,使其成为主内存表,而旧的内存表仍然可用于读取。

After this, all writes go only to the new memtable, making it primary, while the old one is still available for reads.

同花最终确定
Flush finalization

在表视图中用刷新的磁盘驻留表替换旧的内存表。

Replaces the old memtable with a flushed disk-resident table in the table view.

预写日志截断
Write-ahead log truncation

丢弃保存与已刷新内存表关联的记录的日志段。

Discards a log segment holding records associated with a flushed memtable.

这些操作具有严重的正确性影响。继续写入旧内存表可能会导致数据丢失;例如,如果写入已写入已刷新的内存表部分。同样,如果在磁盘驻留的对应项准备就绪之前无法让旧的内存表可供读取,则会导致结果不完整。

These operations have severe correctness implications. Continuing writes to the old memtable might result in data loss; for example, if the write is made into a memtable section that was already flushed. Similarly, failing to leave the old memtable available for reads until its disk-resident counterpart is ready will result in incomplete results.

在压缩过程中,表视图也会发生变化,但这里的过程稍微简单一些:旧的磁盘驻留表被丢弃,并添加压缩版本。旧表必须保持可读取状态,直到新表完全写入并准备好替换它们以进行读取。还必须避免相同表参与并行运行的多个压缩的情况。

During compaction, the table view is also changed, but here the process is slightly more straightforward: old disk-resident tables are discarded, and the compacted version is added instead. Old tables have to remain accessible for reads until the new one is fully written and is ready to replace them for reads. Situations in which the same tables participate in multiple compactions running in parallel have to be avoided as well.

在 B 树中,日志截断必须与从页面缓存中刷新脏页相协调,以保证持久性。在LSM Trees中,我们有类似的要求:写入缓冲在memtable中,并且它们的内容在完全刷新之前不持久,因此日志截断必须与memtable刷新协调。一旦刷新完成,日志管理器就会收到有关最新刷新的日志段的信息,并且可以安全地丢弃其内容。

In B-Trees, log truncation has to be coordinated with flushing dirty pages from the page cache to guarantee durability. In LSM Trees, we have a similar requirement: writes are buffered in a memtable, and their contents are not durable until fully flushed, so log truncation has to be coordinated with memtable flushes. As soon as the flush is complete, the log manager is given the information about the latest flushed log segment, and its contents can be safely discarded.

日志截断与刷新不同步也会导致数据丢失:如果在刷新完成之前丢弃某个日志段,并且节点崩溃,则日志内容将不会重播,并且该段中的数据将无法恢复。

Not synchronizing log truncations with flushes will also result in data loss: if a log segment is discarded before the flush is complete, and the node crashes, log contents will not be replayed, and data from this segment won’t be restored.

日志堆叠

Log Stacking

许多现代文件系统是日志结构的:它们缓冲内存段中的写入,并在磁盘变满时以仅追加方式将其内容刷新到磁盘上。SSD 也使用日志结构存储来处理小型随机写入、最小化写入开销、改善磨损均衡并延长设备寿命。

Many modern filesystems are log structured: they buffer writes in a memory segment and flush its contents on disk when it becomes full in an append-only manner. SSDs use log-structured storage, too, to deal with small random writes, minimize write overhead, improve wear leveling, and increase device lifetime.

日志结构当 SSD 变得更加便宜时,存储 (LSS) 系统开始流行起来。LSM 树和 SSD 是很好的搭配,因为顺序工作负载和仅追加写入有助于减少就地更新的放大,从而对 SSD 的性能产生负面影响。

Log-structured storage (LSS) systems started gaining popularity around the time SSDs were becoming more affordable. LSM Trees and SSDs are a good match, since sequential workloads and append-only writes help to reduce amplification from in-place updates, which negatively affect performance on SSDs.

如果我们将多个日志结构系统堆叠在一起,我们可能会遇到我们试图使用 LSS 解决的几个问题,包括写放大、碎片和性能不佳。至少,在开发应用程序时,我们需要牢记 SSD 闪存转换层和文件系统[YANG14]

If we stack multiple log-structured systems on top each other, we can run into several problems that we were trying to solve using LSS, including write amplification, fragmentation, and poor performance. At the very least, we need to keep the SSD flash translation layer and the filesystem in mind when developing our applications [YANG14].

闪存翻译层

Flash Translation Layer

使用SSD 中的日志结构映射层由两个因素推动:小型随机写入必须在物理页中批量处理,以及 SSD 通过使用编程/擦除周期来工作。只能写入先前擦除的页面。这意味着页面不能被编程(换句话说,写入),除非它是空的(换句话说,被擦除)。

Using a log-structuring mapping layer in SSDs is motivated by two factors: small random writes have to be batched together in a physical page, and the fact that SSDs work by using program/erase cycles. Writes can be done only into previously erased pages. This means that a page cannot be programmed (in other words, written) unless it is empty (in other words, was erased).

单个页面无法被擦除, 只能一起擦除中的页面组(通常包含64至 512 个页面)。图 7-13显示了分组为块的页面的示意图。闪存转换层 (FTL) 将逻辑页地址转换为其物理位置并跟踪页状态(活动、丢弃或空)。当 FTL 用完可用页时,它必须执行垃圾收集并擦除丢弃的页。

A single page cannot be erased, and only groups of pages in a block (typically holding 64 to 512 pages) can be erased together. Figure 7-13 shows a schematic representation of pages, grouped into blocks. The flash translation layer (FTL) translates logical page addresses to their physical locations and keeps track of page states (live, discarded, or empty). When FTL runs out of free pages, it has to perform garbage collection and erase discarded pages.

数据库0713
图 7-13。SSD页面,分组为块

不保证块中即将被擦除的所有页都被丢弃。在擦除块之前,FTL 必须将其活动页重新定位到包含空页的块之一。图7-14显示了将活动页面从一个块移动到新位置的过程。

There are no guarantees that all pages in the block that is about to be erased are discarded. Before the block can be erased, FTL has to relocate its live pages to one of the blocks containing empty pages. Figure 7-14 shows the process of moving live pages from one block to new locations.

数据库0714
图 7-14。垃圾回收期间的页面重定位

当所有活动页都被重新定位时,可以安全地擦除该块,并且其空页可用于写入。由于 FTL 知道页面状态和状态转换并拥有所有必要的信息,因此它还负责SSD磨损均衡

When all live pages are relocated, the block can be safely erased, and its empty pages become available for writes. Since FTL is aware of page states and state transitions and has all the necessary information, it is also responsible for SSD wear leveling.

笔记

磨损均衡将负载均匀地分布在介质上,避免出现热点,即由于大量的编程擦除周期而导致块过早失效。这是必需的,因为闪存单元只能经历有限数量的编程擦除周期,并且均匀地使用存储单元有助于延长设备的使用寿命。

Wear leveling distributes the load evenly across the medium, avoiding hotspots, where blocks fail prematurely because of a high number of program-erase cycles. It is required, since flash memory cells can go through only a limited number of program-erase cycles, and using memory cells evenly helps to extend the lifetime of the device.

总而言之,在 SSD 上使用日志结构存储的动机是通过将小型随机写入分批在一起来摊销 I/O 成本,这通常会导致操作次数减少,从而减少触发垃圾收集的次数。

In summary, the motivation for using log-structured storage on SSDs is to amortize I/O costs by batching small random writes together, which generally results in a smaller number of operations and, subsequently, reduces the number of times the garbage collection is triggered.

文件系统日志记录

Filesystem Logging

最重要的是,我们得到了文件系统,其中许多还使用日志记录技术进行写入缓冲,以减少写入放大并最佳地使用底层硬件。

On top of that, we get filesystems, many of which also use logging techniques for write buffering to reduce write amplification and use the underlying hardware optimally.

日志堆叠有几种不同的表现形式。首先,每一层都必须执行自己的簿记,并且大多数情况下底层日志不会公开避免重复工作所需的信息。

Log stacking manifests in a few different ways. First, each layer has to perform its own bookkeeping, and most often the underlying log does not expose the information necessary to avoid duplicating the efforts.

图 7-15显示了较高级别日志(例如应用程序)和较低级别日志(例如文件系统)之间的映射,导致冗余日志记录和不同的垃圾收集模式 [YANG14 ]。未对齐的段写入可能会使情况变得更糟,因为丢弃更高级别的日志段可能会导致相邻段部分的碎片和重新定位。

Figure 7-15 shows a mapping between a higher-level log (for example, the application) and a lower-level log (for example, the filesystem) resulting in redundant logging and different garbage collection patterns [YANG14]. Misaligned segment writes can make the situation even worse, since discarding a higher-level log segment may cause fragmentation and relocation of the neighboring segments’ parts.

数据库0715
图 7-15。未对齐的写入和丢弃更高级别的日志段

由于层不传达LSS相关的调度(例如,丢弃或重新定位段),因此较低级别的子系统可能对丢弃的数据或即将丢弃的数据执行冗余操作。同样,由于没有单一的标准段大小,因此可能会发生未对齐的较高级别段占用多个较低级别段的情况。所有这些开销都可以减少或完全避免。

Because layers do not communicate LSS-related scheduling (for example, discarding or relocating segments), lower-level subsystems might perform redundant operations on discarded data or the data that is about to be discarded. Similarly, because there’s no single, standard segment size, it may happen that unaligned higher-level segments occupy multiple lower-level segments. All these overheads can be reduced or completely avoided.

尽管我们说日志结构存储都是关于顺序 I/O 的,但我们必须记住数据库系统可能有多个写入流(例如,日志写入与数据记录写入并行)[YANG14 ]。当在硬件级别考虑时,交错的顺序写入流可能不会转换为相同的顺序模式:块不一定会按写入顺序放置。图 7-16显示了多个流在时间上重叠,写入的记录的大小与底层硬件页面大小不一致。

Even though we say that log-structured storage is all about sequential I/O, we have to keep in mind that database systems may have multiple write streams (for example, log writes parallel to data record writes) [YANG14]. When considered on a hardware level, interleaved sequential write streams may not translate into the same sequential pattern: blocks are not necessarily going to be placed in write order. Figure 7-16 shows multiple streams overlapping in time, writing records that have sizes not aligned with the underlying hardware page size.

数据库0716
图 7-16。未对齐的多流写入

这导致我们试图避免的碎片化。为了减少交错,一些数据库供应商建议将日志保留在单独的设备上,以隔离工作负载并能够独立推断其性能和访问模式。然而,更重要的是保持分区与底层硬件对齐[INTEL14]并保持写入与页面大小对齐[KIM12]

This results in fragmentation that we tried to avoid. To reduce interleaving, some database vendors recommend keeping the log on a separate device to isolate workloads and be able to reason about their performance and access patterns independently. However, it is more important to keep partitions aligned to the underlying hardware [INTEL14] and keep writes aligned to page size [KIM12].

LLAMA 和正念堆叠

LLAMA and Mindful Stacking

好吧,你永远不会相信这一点,但你看到的那只美洲驼曾经是一个人类。而且不仅仅是任何人。那个人是一个皇帝。一个富有、强大的魅力球。

库兹科(选自《皇帝的新节奏》)

Well, you’ll never believe this, but that llama you’re looking at was once a human being. And not just any human being. That guy was an emperor. A rich, powerful ball of charisma.

Kuzco from The Emperor’s New Groove

“Bw-Trees”中,我们讨论了称为 Bw-Tree 的不可变 B-Tree 版本。Bw-Tree 位于无锁存、日志结构、访问方法感知(LLAMA) 存储子系统之上。这种分层允许 Bw-Tree 动态增长和收缩,同时使垃圾收集和页面管理对树来说是透明的。在这里,我们最感兴趣的是访问方法感知部分,展示了软件层之间协调的好处。

In “Bw-Trees”, we discussed an immutable B-Tree version called Bw-Tree. Bw-Tree is layered on top of a latch-free, log-structured, access-method aware (LLAMA) storage subsystem. This layering allows Bw-Trees to grow and shrink dynamically, while leaving garbage collection and page management transparent for the tree. Here, we’re most interested in the access-method aware part, demonstrating the benefits of coordination between the software layers.

回顾一下,逻辑Bw-Tree 节点由物理增量节点的链表组成,这是从最新节点到最旧节点的更新链,以基本节点结束。逻辑节点使用内存中的映射表进行链接,指向磁盘上最新更新的位置。键和值可以添加到逻辑节点或从逻辑节点中删除,但它们的物理表示仍然不可变。

To recap, a logical Bw-Tree node consists of a linked list of physical delta nodes, a chain of updates from the newest one to the oldest one, ending in a base node. Logical nodes are linked using an in-memory mapping table, pointing to the location of the latest update on disk. Keys and values are added to and removed from the logical nodes, but their physical representations remain immutable.

日志结构存储将节点更新(增量节点)一起缓冲在 4 Mb 刷新缓冲区中。一旦页面填满,它就会刷新到磁盘上。垃圾收集会定期回收未使用的增量节点和基本节点所占用的空间,并重新定位活动节点以释放碎片页面。

Log-structured storage buffers node updates (delta nodes) together in 4 Mb flush buffers. As soon as the page fills up, it’s flushed on disk. Periodically, garbage collection reclaims space occupied by the unused delta and base nodes, and relocates the live ones to free up fragmented pages.

如果没有访问方法意识,属于不同逻辑节点的交错增量节点将按照其插入顺序写入。LLAMA 中的 Bw-Tree 感知允许将多个增量节点合并到单个连续的物理位置。如果增量节点中的两次更新相互抵消(例如先插入后删除),也可以进行逻辑合并,并且只能持久化后一个删除。

Without access-method awareness, interleaved delta nodes that belong to different logical nodes will be written in their insertion order. Bw-Tree awareness in LLAMA allows for the consolidation of several delta nodes into a single contiguous physical location. If two updates in delta nodes cancel each other (for example, an insert followed by delete), their logical consolidation can be performed as well, and only the latter delete can be persisted.

LSS 垃圾收集还可以负责合并逻辑 Bw-Tree 节点内容。这意味着垃圾收集不仅会回收空闲空间,还会显着减少物理节点碎片。如果垃圾收集仅连续重写几个增量节点,它们仍然会占用相同数量的空间,并且读者需要执行将增量更新应用到基本节点的工作。同时,如果更高级别的系统合并节点并将它们连续写入新位置,LSS仍然需要对旧版本进行垃圾收集。

LSS garbage collection can also take care of consolidating the logical Bw-Tree node contents. This means that garbage collection will not only reclaim the free space, but also significantly reduce the physical node fragmentation. If garbage collection only rewrote several delta nodes contiguously, they would still take the same amount of space, and readers would need to perform the work of applying the delta updates to the base node. At the same time, if a higher-level system consolidated the nodes and wrote them contiguously to the new locations, LSS would still have to garbage-collect the old versions.

通过了解 Bw-Tree 语义,可以将多个增量重写为单个基本节点,其中所有增量都已在垃圾收集期间应用。这减少了用于表示此 Bw-Tree 节点的总空间以及读取页面所需的延迟,同时回收被丢弃页面占用的空间。

By being aware of Bw-Tree semantics, several deltas may be rewritten as a single base node with all deltas already applied during garbage collection. This reduces the total space used to represent this Bw-Tree node and the latency required to read the page while reclaiming the space occupied by discarded pages.

您会发现,如果仔细考虑,堆叠可以带来很多好处。没有必要总是构建紧密耦合的单层结构。良好的 API 和公开正确的信息可以显着提高效率。

You can see that, when considered carefully, stacking can yield many benefits. It is not necessary to always build tightly coupled single-level structures. Good APIs and exposing the right information can significantly improve efficiency.

开放通道 SSD

Open-Channel SSDs

一个堆叠软件层的替代方法是跳过所有间接层并直接使用硬件。例如,通过开发开放通道 SSD 可以避免使用文件系统和闪存转换层。这样,我们可以避免至少两层日志,并对磨损均衡、垃圾收集、数据放置和调度有更多的控制。使用这种方法的实现之一是 LOCS(Open-Channel SSD 上基于 LSM 树的 KV 存储)[ZHANG13]。使用开放通道 SSD 的另一个示例是 LightNVM,它在 Linux 内核中实现[BJØRLING17]

An alternative to stacking software layers is to skip all indirection layers and use the hardware directly. For example, it is possible to avoid using a filesystem and flash translation layer by developing for Open-Channel SSDs. This way, we can avoid at least two layers of logs and have more control over wear-leveling, garbage collection, data placement, and scheduling. One of the implementations that uses this approach is LOCS (LSM Tree-based KV Store on Open-Channel SSD) [ZHANG13]. Another example using Open-Channel SSDs is LightNVM, implemented in the Linux kernel [BJØRLING17].

闪存转换层通常处理数据放置、垃圾收集和页面重定位。开放通道 SSD 无需经过 FTL 即可公开其内部结构、驱动器管理和 I/O 调度。虽然从开发人员的角度来看,这当然需要更多地关注细节,但这种方法可能会带来显着的性能改进。O_DIRECT您可以与使用该标志绕过内核页面缓存进行类比,这可以提供更好的控制,但需要手动页面管理。

The flash translation layer usually handles data placement, garbage collection, and page relocation. Open-Channel SSDs expose their internals, drive management, and I/O scheduling without needing to go through the FTL. While this certainly requires much more attention to detail from the developer’s perspective, this approach may yield significant performance improvements. You can draw a parallel with using the O_DIRECT flag to bypass the kernel page cache, which gives better control, but requires manual page management.

软件定义闪存(SDF)[OUYANG14]硬件/软件协同设计的开放通道 SSD 系统,公开了考虑 SSD 细节的非对称 I/O 接口。读写单元的大小不同,写入单元大小与擦除单元大小(块)相对应,大大降低了写入放大。此设置非常适合日志结构存储,因为只有一个软件层执行垃圾收集和重新定位页面。此外,开发人员还可以访问内部 SSD 并行性,因为 SDF 中的每个通道都作为单独的块设备公开,可用于进一步提高性能。

Software Defined Flash (SDF) [OUYANG14], a hardware/software codesigned Open-Channel SSDs system, exposes an asymmetric I/O interface that takes SSD specifics into consideration. Sizes of read and write units are different, and write unit size corresponds to erase unit size (block), which greatly reduces write amplification. This setting is ideal for log-structured storage, since there’s only one software layer that performs garbage collection and relocates pages. Additionally, developers have access to internal SSD parallelism, since every channel in SDF is exposed as a separate block device, which can be used to further improve performance.

将复杂性隐藏在简单的 API 背后可能听起来很引人注目,但在软件层具有不同语义的情况下可能会导致复杂性。暴露一些底层系统内部可能有利于更好的集成。

Hiding complexity behind a simple API might sound compelling, but can cause complications in cases in which software layers have different semantics. Exposing some underlying system internals may be beneficial for better integration.

概括

Summary

日志结构存储无处不在:从闪存转换层到文件系统和数据库系统。它通过在内存中批量处理小型随机写入来帮助减少写入放大。为了回收被删除的段占用的空间,LSS 定期触发垃圾收集。

Log-structured storage is used everywhere: from the flash translation layer, to filesystems and database systems. It helps to reduce write amplification by batching small random writes together in memory. To reclaim space occupied by removed segments, LSS periodically triggers garbage collection.

LSM 树借鉴了 LSS 的一些想法,有助于构建以日志结构方式管理的索引结构:写入在内存中批量写入并刷新到磁盘上;阴影数据记录在压缩期间被清除。

LSM Trees take some ideas from LSS and help to build index structures managed in a log-structured manner: writes are batched in memory and flushed on disk; shadowed data records are cleaned up during compaction.

重要的是要记住,许多软件层都使用 LSS,并确保层以最佳方式堆叠。或者,我们可以完全跳过文件系统级别并直接访问硬件。

It is important to remember that many software layers use LSS, and make sure that layers are stacked optimally. Alternatively, we can skip the filesystem level altogether and access hardware directly.

第一部分 结论

Part I Conclusion

在第一部分中,我们一直在讨论存储引擎。我们从高级数据库系统架构和分类开始,学习了如何实现磁盘存储结构,以及它们如何与其他组件融合在一起。

In Part I, we’ve been talking about storage engines. We started from high-level database system architecture and classification, learned how to implement on-disk storage structures, and how they fit into the full picture with other components.

我们已经见过几种存储结构,从 B 树开始。所讨论的结构并不代表整个领域,并且还有许多其他有趣的发展。然而,这些例子仍然很好地说明了我们一开始确定的三个属性这部分的内容:缓冲不变性排序。这些属性对于描述、记忆和表达存储结构的不同方面很有用。

We’ve seen several storage structures, starting from B-Trees. The discussed structures do not represent an entire field, and there are many other interesting developments. However, these examples are still a good illustration of the three properties we identified at the beginning of this part: buffering, immutability, and ordering. These properties are useful for describing, memorizing, and expressing different aspects of the storage structures.

图 I-1总结了所讨论的存储结构并显示了它们是否正在使用这些属性。

Figure I-1 summarizes the discussed storage structures and shows whether or not they’re using these properties.

添加内存缓冲区总是对写放大产生积极的影响。在像 WiredTiger 和 LA-Trees 这样的就地更新结构中,内存缓冲有助于通过组合多个相同页面写入来分摊成本。换句话说,缓冲有助于减少写放大。

Adding in-memory buffers always has a positive impact on write amplification. In in-place update structures like WiredTiger and LA-Trees, in-memory buffering helps to amortize the cost of multiple same-page writes by combining them. In other words, buffering helps to reduce write amplification.

在不可变结构中,例如多组件 LSM 树和 FD 树,缓冲具有类似的积极效果,但代价是将来在将数据从一个不可变级别移动到另一级别时进行重写。换句话说,使用不变性可能会导致延迟写入放大。同时,使用不可变性对并发性和空间放大有积极的影响,因为大多数讨论的不可变结构都使用完全占用的页面。

In immutable structures, such as multicomponent LSM Trees and FD-Trees, buffering has a similar positive effect, but at a cost of future rewrites when moving data from one immutable level to the other. In other words, using immutability may lead to deferred write amplification. At the same time, using immutability has a positive impact on concurrency and space amplification, since most of the discussed immutable structures use fully occupied pages.

当使用不变性时,除非我们使用缓冲,否则我们最终会得到像 Bitcask 和 WiscKey 这样的无序存储结构(写时复制 B 树除外,它会复制、重新排序和重新定位其页面)。WiscKey 仅存储排序 LSM 树中的,并允许使用键索引按键顺序检索记录。在 Bw-Tree 中,一些节点(合并的节点)按键顺序保存数据记录,而其余逻辑 Bw-Tree 节点的增量更新可能分散在不同的页面上。

When using immutability, unless we also use buffering, we end up with unordered storage structures like Bitcask and WiscKey (with the exception of copy-on-write B-Trees, which copy, re-sort, and relocate their pages). WiscKey stores only keys in sorted LSM Trees and allows retrieving records in key order using the key index. In Bw-Trees, some of the nodes (ones that were consolidated) hold data records in key order, while the rest of the logical Bw-Tree nodes may have their delta updates scattered across different pages.

数据库0001
图 I-1。所讨论的存储结构的缓冲、不变性和排序属性。(1) WiscKey 仅使用缓冲来保持键的排序顺序。(2) 只有 Bw-Tree 中的合并节点保存有序记录。

您会看到,这三个属性可以混合和匹配,以获得所需的特性。不幸的是,存储引擎设计通常涉及权衡:您会增加一项操作的成本,而有利于另一项操作。

You see that these three properties can be mixed and matched in order to achieve the desired characteristics. Unfortunately, storage engine design usually involves trade-offs: you increase the cost of one operation in favor of the other.

利用这些知识,您应该能够开始仔细研究大多数现代数据库系统的代码。一些代码参考和起点可以在整本书中找到。了解和理解术语将使您更轻松地完成此过程。

Using this knowledge, you should be able to start looking closer at the code of most modern database systems. Some of the code references and starting points can be found across the entire book. Knowing and understanding the terminology will make this process easier for you.

许多现代数据库系统都由概率数据结构[FLAJOLET12] [CORMODE04]提供支持,并且正在开展新的研究,将机器学习的想法引入数据库系统[KRASKA18]。随着非易失性和字节可寻址存储变得更加普遍和广泛使用,我们将在研究和行业中经历进一步的变化[VENKATARAMAN11]

Many modern database systems are powered by probabilistic data structures [FLAJOLET12] [CORMODE04], and there’s new research being done on bringing ideas from machine learning into database systems [KRASKA18]. We’re about to experience further changes in research and industry as nonvolatile and byte-addressable storage becomes more prevalent and widely available [VENKATARAMAN11].

了解本书中描述的基本概念应该有助于您理解和实施更新的研究,因为它借鉴、建立在相同的概念之上,并受到相同概念的启发。了解理论和历史的主要优点是,没有什么是全新的,而且正如本书的叙述所示,进步是渐进的。

Knowing the fundamental concepts described in this book should help you to understand and implement newer research, since it borrows from, builds upon, and is inspired by the same concepts. The major advantage of knowing the theory and history is that there’s nothing entirely new and, as the narrative of this book shows, progress is incremental.

第二部分。分布式系统

Part II. Distributed Systems

在分布式系统中,您甚至不知道其存在的计算机出现故障可能会导致您自己的计算机无法使用。

莱斯利·兰波特

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

Leslie Lamport

没有如果采用分布式系统,我们将无法远距离打电话、转账或交换信息。我们每天都使用分布式系统。有时,即使不承认:任何客户端/服务器应用程序都是分布式系统。

Without distributed systems, we wouldn’t be able to make phone calls, transfer money, or exchange information over long distances. We use distributed systems daily. Sometimes, even without acknowledging it: any client/server application is a distributed system.

为了对于许多现代软件系统来说,垂直扩展(通过在具有更多 CPU、RAM 或更快磁盘的更大、更快的计算机上运行相同的软件来进行扩展)是不可行的。更大的机器更昂贵,更难更换,并且可能需要特殊维护。一个另一种方法是水平扩展:在通过网络连接并作为单个逻辑实体工作的多台机器上运行软件。

For many modern software systems, vertical scaling (scaling by running the same software on a bigger, faster machine with more CPU, RAM, or faster disks) isn’t viable. Bigger machines are more expensive, harder to replace, and may require special maintenance. An alternative is to scale horizontally: to run software on multiple machines connected over the network and working as a single logical entity.

分布式系统的规模可能有所不同,从少数到数百台机器,其参与者的特征也可能有所不同,从小型手持设备或传感器设备到高性能计算机。

Distributed systems might differ both in size, from a handful to hundreds of machines, and in characteristics of their participants, from small handheld or sensor devices to high-performance computers.

数据库系统主要运行在单个节点上的时代已经一去不复返了,大多数现代数据库系统都将多个节点连接成集群,以增加存储容量、提高性能、增强可用性。

The time when database systems were mainly running on a single node is long gone, and most modern database systems have multiple nodes connected in clusters to increase storage capacity, improve performance, and enhance availability.

尽管分布式计算的一些理论突破并不新鲜,但它们的大多数实际应用都是最近发生的。今天,我们看到人们对这个主题越来越感兴趣,更多的研究和新的开发正在进行。

Even though some of the theoretical breakthroughs in distributed computing aren’t new, most of their practical application happened relatively recently. Today, we see increasing interest in the subject, more research, and new development being done.

第二部分。基本定义

Part II. Basic definitions

一个分布式系统,我们有几个 参与者(有时称为进程节点副本)。每个参与者都有自己的当地. 参与者通过使用他们之间的通信链路交换消息来进行通信。

In a distributed system, we have several participants (sometimes called processes, nodes, or replicas). Each participant has its own local state. Participants communicate by exchanging messages using communication links between them.

进程可以使用以下方式访问时间时钟,可以是逻辑的物理。逻辑时钟是使用一种单调增长的计数器来实现的。物理时钟,也称为挂钟,与物理世界中的时间概念绑定,并且可以通过进程本地方式访问;例如,通过操作系统

Processes can access the time using a clock, which can be logical or physical. Logical clocks are implemented using a kind of monotonically growing counter. Physical clocks, also called wall clocks, are bound to a notion of time in the physical world and are accessible through process-local means; for example, through an operating system.

它是在谈论分布式系统时,不可能不提及由于其各部分彼此分开而造成的固有困难。远程进程通过可能缓慢且不可靠的链接进行通信,这使得了解远程进程的确切状态变得更加复杂。

It’s impossible to talk about distributed systems without mentioning the inherent difficulties caused by the fact that its parts are located apart from each other. Remote processes communicate through links that can be slow and unreliable, which makes knowing the exact state of the remote process more complicated.

分布式系统领域的大多数研究都与没有什么是完全可靠的事实有关:通信通道可能会延迟、重新排序或无法传递消息;进程可能会暂停、减慢、崩溃、失控或突然停止响应。

Most of the research in the distributed systems field is related to the fact that nothing is entirely reliable: communication channels may delay, reorder, or fail to deliver the messages; processes may pause, slow down, crash, go out of control, or suddenly stop responding.

并发和分布式编程领域有许多共同的主题,因为 CPU 是具有链路、处理器和通信协议的微型分布式系统。您将在“一致性模型”中看到与并发编程的许多相似之处。然而,由于远程各方之间的通信成本以及链路和进程的不可靠性,大多数原语不能直接重用。

There are many themes in common in the fields of concurrent and distributed programming, since CPUs are tiny distributed systems with links, processors, and communication protocols. You’ll see many parallels with concurrent programming in “Consistency Models”. However, most of the primitives can’t be reused directly because of the costs of communication between remote parties, and the unreliability of links and processes.

为了克服分布式环境的困难,我们需要使用一类特定的算法,即分布式算法,它们具有本地和远程状态和执行的概念,并且即使网络不可靠和组件故障也能工作。我们用状态步骤(或阶段)以及它们之间的转换来描述算法。每个进程在本地执行算法步骤,本地执行和进程交互的结合构成了分布式算法。

To overcome the difficulties of the distributed environment, we need to use a particular class of algorithms, distributed algorithms, which have notions of local and remote state and execution and work despite unreliable networks and component failures. We describe algorithms in terms of state and steps (or phases), with transitions between them. Each process executes the algorithm steps locally, and a combination of local executions and process interactions constitutes a distributed algorithm.

分布式算法描述了多个独立节点的本地行为和交互。节点通过相互发送消息进行通信。算法定义参与者角色、交换的消息、状态、转换、执行的步骤、交付介质的属性、时序假设、故障模型以及描述流程及其交互的其他特征。

Distributed algorithms describe the local behavior and interaction of multiple independent nodes. Nodes communicate by sending messages to each other. Algorithms define participant roles, exchanged messages, states, transitions, executed steps, properties of the delivery medium, timing assumptions, failure models, and other characteristics that describe processes and their interactions.

分布式算法有许多不同的用途:

Distributed algorithms serve many different purposes:

协调
Coordination

A监督多名工人的行动和行为的过程。

A process that supervises the actions and behavior of several workers.

合作
Cooperation

多种的参与者相互依赖来完成任务。

Multiple participants relying on one another for finishing their tasks.

传播
Dissemination

流程合作将信息快速可靠地传播给所有相关方。

Processes cooperating in spreading the information to all interested parties quickly and reliably.

共识
Consensus

在多个进程之间达成一致。

Achieving agreement among multiple processes.

在本书中,我们在算法的使用过程中讨论算法,并且更喜欢实用的方法而不是纯粹的学术材料。首先,我们涵盖所有必要的抽象、过程以及它们之间的连接,并逐步构建更复杂的通信模式。我们从 UDP 开始,其中发送方无法保证其消息是否已到达目的地;最后,达成共识,即多个进程就特定值达成一致。

In this book, we talk about algorithms in the context of their usage and prefer a practical approach over purely academic material. First, we cover all necessary abstractions, the processes and the connections between them, and progress to building more complex communication patterns. We start with UDP, where the sender doesn’t have any guarantees on whether or not its message has reached its destination; and finally, to achieve consensus, where multiple processes agree on a specific value.

第 8 章介绍和概述

Chapter 8. Introduction and Overview

什么使得分布式系统与单节点系统有着本质上的不同?我们来看一个简单的例子并尝试看看。在单线程程序中,我们定义变量和执行过程(一组步骤)。

What makes distributed systems inherently different from single-node systems? Let’s take a look at a simple example and try to see. In a single-threaded program, we define variables and the execution process (a set of steps).

例如,我们可以定义一个变量并对其执行简单的算术运算:

For example, we can define a variable and perform simple arithmetic operations over it:

整数 i = 1;
我 += 2;
我* = 2;
int i = 1;
i += 2;
i *= 2;

我们有一个执行历史记录:我们声明一个变量,将其增加 2,然后乘以 2,并得到结果:6。假设我们没有一个执行线程来执行这些操作,而是有两个线程对变量具有读写访问权限x

We have a single execution history: we declare a variable, increment it by two, then multiply it by two, and get the result: 6. Let’s say that, instead of having one execution thread performing these operations, we have two threads that have read and write access to variable x.

并发执行

Concurrent Execution

作为一旦允许两个执行线程访问该变量,并发步骤执行的确切结果就不可预测,除非这些步骤在线程之间同步。我们最终得到了四种可能的结果,而不是单一的,如图8-1所示。1

As soon as two execution threads are allowed to access the variable, the exact outcome of the concurrent step execution is unpredictable, unless the steps are synchronized between the threads. Instead of a single possible outcome, we end up with four, as Figure 8-1 shows.1

数据库0801
图 8-1。并发执行可能的交错
  • a) x = 2,如果两个线程都读取初始值,则加法器写入其值,但会被乘法结果覆盖。

  • a) x = 2, if both threads read an initial value, the adder writes its value, but it is overwritten with the multiplication result.

  • b) x = 3,如果两个线程都读取初始值,乘法器将写入其值,但会被加法结果覆盖。

  • b) x = 3, if both threads read an initial value, the multiplier writes its value, but it is overwritten with the addition result.

  • c) x = 4,如果乘法器可以在加法器启动之前读取初始值并执行其运算。

  • c) x = 4, if the multiplier can read the initial value and execute its operation before the adder starts.

  • d) x = 6,如果加法器可以在乘法器启动之前读取初始值并执行其运算。

  • d) x = 6, if the adder can read the initial value and execute its operation before the multiplier starts.

甚至在我们能够跨越单节点边界之前,我们就遇到了分布式系统中的第一个问题:并发性。每个并发程序都具有分布式系统的一些属性。线程访问共享状态,在本地执行一些操作,并将结果传播回共享变量。

Even before we can cross a single node boundary, we encounter the first problem in distributed systems: concurrency. Every concurrent program has some properties of a distributed system. Threads access the shared state, perform some operations locally, and propagate the results back to the shared variables.

精确定义执行历史并减少可能结果的数量,我们需要一致性模型。一致性模型描述并发执行并建立可以执行操作并使参与者可见的操作顺序。使用不同的一致性模型,我们可以限制或放宽系统可以处于的状态数量。

To define execution histories precisely and reduce the number of possible outcomes, we need consistency models. Consistency models describe concurrent executions and establish an order in which operations can be executed and made visible to the participants. Using different consistency models, we can constraint or relax the number of states the system can be in.

那里分布式系统和并发计算领域的术语和研究有很多重叠,但也存在一些差异。在并发系统中,我们可以拥有共享内存,处理器可以使用共享内存来交换信息。在分布式系统中,每个处理器都有其本地状态,参与者通过传递消息进行通信。

There is a lot of overlap in terminology and research in the areas of distributed systems and concurrent computing, but there are also some differences. In a concurrent system, we can have shared memory, which processors can use to exchange the information. In a distributed system, each processor has its local state and participants communicate by passing messages.

分布式系统中的共享状态

Shared State in a Distributed System

我们可以尝试向分布式系统引入一些共享内存的概念,例如单一信息源,例如数据库。即使我们解决了并发访问的问题,我们仍然不能保证所有进程都是同步的。

We can try to introduce some notion of shared memory to a distributed system, for example, a single source of information, such as database. Even if we solve the problems with concurrent access to it, we still cannot guarantee that all processes are in sync.

要访问此数据库,进程必须通过通信介质发送和接收消息来查询或修改状态。但是,如果其中一个进程长时间没有收到数据库的响应,会发生什么情况呢?要回答这个问题,我们首先必须定义更长的含义。为此,需要将系统必须用同步来描述:通信是否完全异步,或者是否有一些时序假设。这些时序假设允许我们引入操作超时和重试。

To access this database, processes have to go over the communication medium by sending and receiving messages to query or modify the state. However, what happens if one of the processes does not receive a response from the database for a longer time? To answer this question, we first have to define what longer even means. To do this, the system has to be described in terms of synchrony: whether the communication is fully asynchronous, or whether there are some timing assumptions. These timing assumptions allow us to introduce operation timeouts and retries.

我们不知道数据库未响应是否是因为过载、不可用或缓慢,或者是因为连接数据库的网络出现问题。这描述了崩溃的本质:进程可能因无法参与进一步的算法步骤、出现临时故障或忽略某些消息而崩溃。我们在我们决定如何处理故障之前,需要定义故障模型并描述故障可能发生的方式。

We do not know whether the database hasn’t responded because it’s overloaded, unavailable, or slow, or because of some problems with the network on the way to it. This describes a nature of a crash: processes may crash by failing to participate in further algorithm steps, having a temporary failure, or by omitting some of the messages. We need to define a failure model and describe ways in which failures can occur before we decide how to treat them.

A描述系统可靠性以及在出现故障时是否能继续正常运行的属性称为容错能力。故障是不可避免的,因此我们需要构建具有可靠组件的系统,而消除上述单节点数据库形式的单点故障可能是朝这个方向迈出的第一步。我们可以通过引入一些冗余并添加备份数据库来做到这一点。然而,现在我们面临一个不同的问题:如何保持共享状态的多个副本同步?

A property that describes system reliability and whether or not it can continue operating correctly in the presence of failures is called fault tolerance. Failures are inevitable, so we need to build systems with reliable components, and eliminating a single point of failure in the form of the aforementioned single-node database can be the first step in this direction. We can do this by introducing some redundancy and adding a backup database. However, now we face a different problem: how do we keep multiple copies of shared state in sync?

到目前为止,尝试将共享状态引入我们的简单系统给我们带来的问题多于答案。我们现在知道共享状态并不像仅仅引入数据库那么简单,并且必须采取更细粒度的方法并根据独立进程和在它们之间传递消息来描述交互。

So far, trying to introduce shared state to our simple system has left us with more questions than answers. We now know that sharing state is not as simple as just introducing a database, and have to take a more granular approach and describe interactions in terms of independent processes and passing messages between them.

分布式计算的谬误

Fallacies of Distributed Computing

理想情况下,当两台计算机通过网络进行通信时,一切正常:一个进程打开连接,发送数据,获得响应,每个人都很高兴。假设操作总是成功并且不会出错是危险的,因为当某些事情确实发生并且我们的假设被证明是错误的时,系统的行为方式将难以或无法预测。

In an ideal case, when two computers talk over the network, everything works just fine: a process opens up a connection, sends the data, gets responses, and everyone is happy. Assuming that operations always succeed and nothing can go wrong is dangerous, since when something does break and our assumptions turn out to be wrong, systems behave in ways that are hard or impossible to predict.

最多目前,假设网络是可靠的是合理的事情。它必须至少在某种程度上可靠才能有用。当我们尝试建立与远程服务器的连接并收到错误时,我们都遇到过这样的情况Network is Unreachable。但即使可以建立连接,与服务器的成功初始连接也不能保证链路稳定,并且连接随时可能中断。消息可能已到达远程方,但响应可能已丢失,或者在传送响应之前连接已中断。

Most of the time, assuming that the network is reliable is a reasonable thing to do. It has to be reliable to at least some extent to be useful. We’ve all been in the situation when we tried to establish a connection to the remote server and got a Network is Unreachable error instead. But even if it is possible to establish a connection, a successful initial connection to the server does not guarantee that the link is stable, and the connection can get interrupted at any time. The message might’ve reached the remote party, but the response could’ve gotten lost, or the connection was interrupted before the response was delivered.

网络交换机损坏、电缆断开、网络配置随时可能发生变化。我们应该通过优雅地处理所有这些场景来构建我们的系统。

Network switches break, cables get disconnected, and network configurations can change at any time. We should build our system by handling all of these scenarios gracefully.

连接可以稳定,但我们不能指望远程调用能像本地调用一样快。我们应该对延迟做出尽可能少的假设,并且永远不要假设延迟为零。为了使我们的消息到达远程服务器,它必须经过多个软件层以及光纤或电缆等物理介质。所有这些操作都不是即时的。

A connection can be stable, but we can’t expect remote calls to be as fast as the local ones. We should make as few assumptions about latency as possible and never assume that latency is zero. For our message to reach a remote server, it has to go through several software layers, and a physical medium such as optic fiber or a cable. All of these operations are not instantaneous.

迈克尔·刘易斯 (Michael Lewis) 在他的《Flash Boys》一书中(西蒙和舒斯特)讲述了一个故事:公司花费数百万美元将延迟减少几毫秒,以便能够比竞争对手更快地访问证券交易所。这是利用延迟作为竞争优势的一个很好的例子,但值得一提的是,根据其他一些研究,例如 [ BARTLETT16],过时报价套利的机会(通过了解价格和价格来获利的能力)比竞争对手更快地执行订单)并不能给快速交易者开拓市场的能力。

Michael Lewis, in his Flash Boys book (Simon and Schuster), tells a story about companies spending millions of dollars to reduce latency by several milliseconds to able to access stock exchanges faster than the competition. This is a great example of using latency as a competitive advantage, but it’s worth mentioning that, according to some other studies, such as [BARTLETT16], the chance of stale-quote arbitrage (the ability to profit from being able to know prices and execute orders faster than the competition) doesn’t give fast traders the ability to exploit markets.

吸取教训,我们添加了重试、重新连接,并删除了有关瞬时执行的假设,但这仍然不够。当增加交换消息的数量、速率和大小,或向现有网络添加新进程时,我们不应假设带宽是无限的

Learning our lessons, we’ve added retries, reconnects, and removed the assumptions about instantaneous execution, but this still turns out not to be enough. When increasing the number, rates, and sizes of exchanged messages, or adding new processes to the existing network, we should not assume that bandwidth is infinite.

笔记

1994 年,Peter Deutsch 发表了一份现在著名的断言列表,标题为“分布式计算的谬误”,描述了分布式计算中容易被忽视的方面。除了网络可靠性、延迟和带宽假设之外,他还描述了一些其他问题。例如,网络安全、对手方可能存在、有意和无意的拓扑变化(这些变化可能会打破我们对特定资源的存在和位置的假设)、时间和资源方面的传输成本,以及最后一个单一网络的存在。拥有对整个网络的知识和控制权的权威。

In 1994, Peter Deutsch published a now-famous list of assertions, titled “Fallacies of distributed computing,” describing the aspects of distributed computing that are easy to overlook. In addition to network reliability, latency, and bandwidth assumptions, he describes some other problems. For example, network security, the possible presence of adversarial parties, intentional and unintentional topology changes that can break our assumptions about presence and location of specific resources, transport costs in terms of both time and resources, and, finally, the existence of a single authority having knowledge and control over the entire network.

Deutsch 列出的分布式计算谬误非常详尽,但它重点关注当我们通过链接从一个进程向另一个进程发送消息时可能会出现什么问题。这些担忧是有效的,并描述了最普遍和最低级别的复杂性,但不幸的是,我们在设计和实现分布式系统时对分布式系统做出了许多其他假设,这些假设可能会在操作时导致问题。

Deutsch’s list of distributed computing fallacies is pretty exhaustive, but it focuses on what can go wrong when we send messages from one process to another through the link. These concerns are valid and describe the most general and low-level complications, but unfortunately, there are many other assumptions we make about the distributed systems while designing and implementing them that can cause problems when operating them.

加工

Processing

远程进程可以发送对其刚刚收到的消息的响应,它需要在本地执行一些工作,因此我们不能假设处理是瞬时的。仅仅考虑网络延迟是不够的,因为远程进程执行的操作也不是立即的。

Before a remote process can send a response to the message it just received, it needs to perform some work locally, so we cannot assume that processing is instantaneous. Taking network latency into consideration is not enough, as operations performed by the remote processes aren’t immediate, either.

此外,不能保证消息一送达就立即开始处理。该消息可能会落在远程服务器上的待处理队列中,并且必须在那里等待,直到处理它之前到达的所有消息。

Moreover, there’s no guarantee that processing starts as soon as the message is delivered. The message may land in the pending queue on the remote server, and will have to wait there until all the messages that arrived before it are processed.

节点之间的距离可以更近或更远,可以具有不同的 CPU、RAM 容量、不同的磁盘,或者运行不同的软件版本和配置。我们不能指望他们以相同的速度处理请求。如果我们必须等待多个并行工作的远程服务器响应才能完成任务,则整体执行速度与最慢的远程服务器一样慢。

Nodes can be located closer or further from one another, have different CPUs, amounts of RAM, different disks, or be running different software versions and configurations. We cannot expect them to process requests at the same rate. If we have to wait for several remote servers working in parallel to respond to complete the task, the execution as a whole is as slow as the slowest remote server.

相反人们普遍认为,队列容量不是无限的,堆积更多的请求不会给系统带来任何好处。背压是一种策略,允许我们通过减慢生产者的速度来应对发布消息的速度快于消费者处理消息的速度的生产者。背压是分布式系统中最不受重视和应用的概念之一,通常是事后构建的,而不是系统设计的一个组成部分。

Contrary to the widespread belief, queue capacity is not infinite and piling up more requests won’t do the system any good. Backpressure is a strategy that allows us to cope with producers that publish messages at a rate that is faster than the rate at which consumers can process them by slowing down the producers. Backpressure is one of the least appreciated and applied concepts in distributed systems, often built post hoc instead of being an integral part of the system design.

尽管增加队列容量听起来像是一个好主意,并且可以帮助管道化、并行化和有效地调度请求,但当消息坐在队列中等待轮到时,它们不会发生任何事情。增加队列大小可能会对延迟产生负面影响,因为更改它不会影响处理速率。

Even though increasing the queue capacity might sound like a good idea and can help to pipeline, parallelize, and effectively schedule requests, nothing is happening to the messages while they’re sitting in the queue and waiting for their turn. Increasing the queue size may negatively impact latency, since changing it has no effect on the processing rate.

一般来说,进程本地队列用于实现以下目标:

In general, process-local queues are used to achieve the following goals:

解耦
Decoupling

收据和处理在时间上是分开的并且独立发生。

Receipt and processing are separated in time and happen independently.

流水线
Pipelining

要求不同阶段由系统的独立部分处理。负责接收消息的子系统不必阻塞,直到前一个消息被完全处理为止。

Requests in different stages are processed by independent parts of the system. The subsystem responsible for receiving messages doesn’t have to block until the previous message is fully processed.

吸收短时间爆发
Absorbing short-time bursts

系统负载往往会有所不同,但请求的到达间隔时间对负责请求处理的组件是隐藏的。由于在队列中花费的时间,总体系统延迟会增加,但这通常仍然比响应失败并重试请求要好。

System load tends to vary, but request inter-arrival times are hidden from the component responsible for request processing. Overall system latency increases because of the time spent in the queue, but this is usually still better than responding with a failure and retrying the request.

队列大小取决于工作负载和应用程序。对于相对稳定的工作负载,我们可以通过测量任务处理时间以及每个任务在处理之前在队列中花费的平均时间来调整队列大小,并确保在吞吐量增加的同时延迟保持在可接受的范围内。在这种情况下,队列大小相对较小。对于不可预测的工作负载,当任务突发提交时,队列的大小应考虑突发和高负载。

Queue size is workload- and application-specific. For relatively stable workloads, we can size queues by measuring task processing times and the average time each task spends in the queue before it is processed, and making sure that latency remains within acceptable bounds while throughput increases. In this case, queue sizes are relatively small. For unpredictable workloads, when tasks get submitted in bursts, queues should be sized to account for bursts and high load as well.

远程服务器可以快速处理请求,但这并不意味着我们总能从它那里得到积极的响应。它可能会响应失败:无法写入、搜索到的值不存在或者可能遇到错误。综上所述,即使是最有利的情况,仍然需要我们方面的一些关注。

The remote server can work through requests quickly, but it doesn’t mean that we always get a positive response from it. It can respond with a failure: it couldn’t make a write, the searched value was not present, or it could’ve hit a bug. In summary, even the most favorable scenario still requires some attention from our side.

时钟和时间

Clocks and Time

时间是一种幻觉。午餐时间加倍。

福特 Prefect,银河系漫游指南

Time is an illusion. Lunchtime doubly so.

Ford Prefect, The Hitchhiker’s Guide to the Galaxy

假设远程计算机上的时钟同步运行也可能很危险。加上延迟为零处理是瞬时的,它导致了不同的特性,特别是在时间序列和实时数据处理方面。例如,当从具有不同时间感知的参与者收集和聚合数据时,您应该了解他们之间的时间漂移​​并相应地标准化时间,而不是依赖于源时间戳。除非您使用专门的高精度时间源,否则不应依赖时间戳进行同步或排序。当然,这并不意味着我们根本不能或不应该依赖时间:最终,任何同步系统都使用本地时钟来实现超时。

Assuming that clocks on remote machines run in sync can also be dangerous. Combined with latency is zero and processing is instantaneous, it leads to different idiosyncrasies, especially in time-series and real-time data processing. For example, when collecting and aggregating data from participants with a different perception of time, you should understand time drifts between them and normalize times accordingly, rather than relying on the source timestamp. Unless you use specialized high-precision time sources, you should not rely on timestamps for synchronization or ordering. Of course this doesn’t mean we cannot or should not rely on time at all: in the end, any synchronous system uses local clocks for timeouts.

务必始终考虑进程之间可能存在的时间差异以及消息传递和处理所需的时间。例如,Spanner(请参阅“使用 Spanner 进行分布式事务”)使用特殊的时间 API,该 API 返回时间戳和不确定性界限来强制执行严格的事务顺序。一些故障检测算法依赖于共享的时间概念并保证时钟漂移始终在允许的正确范围内[GUPTA01]

It’s essential to always account for the possible time differences between the processes and the time required for the messages to get delivered and processed. For example, Spanner (see “Distributed Transactions with Spanner”) uses a special time API that returns a timestamp and uncertainty bounds to impose a strict transaction order. Some failure-detection algorithms rely on a shared notion of time and a guarantee that the clock drift is always within allowed bounds for correctness [GUPTA01].

除了分布式系统中的时钟同步很难之外,当前时间是不断变化的:你可以向操作系统请求一个当前的POSIX时间戳,并在执行几个步骤后请求另一个当前时间戳,两者会有所不同。这是一个相当明显的观察结果,但了解时间源和时间戳捕获的确切时刻至关重要。

Besides the fact that clock synchronization in a distributed system is hard, the current time is constantly changing: you can request a current POSIX timestamp from the operating system, and request another current timestamp after executing several steps, and the two will be different. This is a rather obvious observation, but understanding both a source of time and which exact moment the timestamp captures is crucial.

了解时钟源是否是单调的(即,它永远不会倒退)以及计划的时间相关操作可能会漂移多少也很有帮助。

Understanding whether the clock source is monotonic (i.e., that it won’t ever go backward) and how much the scheduled time-related operations might drift can be helpful, too.

状态一致性

State Consistency

最多前面的假设属于几乎总是错误的类别,但有一些更好地描述为并不总是正确:当很容易采取心理捷径并通过以特定方式思考模型来简化模型时,忽略一些棘手的边缘情况。

Most of the previous assumptions fall into the almost always false category, but there are some that are better described as not always true: when it’s easy to take a mental shortcut and simplify the model by thinking of it a specific way, ignoring some tricky edge cases.

分布式算法并不总是保证严格的状态一致性。一些方法具有更宽松的约束,允许副本之间存在状态分歧,并依赖于冲突解决(检测和解决系统内分歧状态的能力)和读取时数据修复(在副本响应时,在读取期间使副本恢复同步)不同的结果)。您可以在第 12 章中找到有关这些概念的更多信息。假设节点之间的状态完全一致可能会导致微妙的错误。

Distributed algorithms do not always guarantee strict state consistency. Some approaches have looser constraints and allow state divergence between replicas, and rely on conflict resolution (an ability to detect and resolve diverged states within the system) and read-time data repair (bringing replicas back in sync during reads in cases where they respond with different results). You can find more information about these concepts in Chapter 12. Assuming that the state is fully consistent across the nodes may lead to subtle bugs.

最终一致的分布式数据库系统可能具有通过在读取期间查询法定数量的节点来处理副本不一致的逻辑,但假设数据库模式和集群视图是强一致的。除非我们强制执行此信息的一致性,否则依赖该假设可能会产生严重后果。

An eventually consistent distributed database system might have the logic to handle replica disagreement by querying a quorum of nodes during reads, but assume that the database schema and the view of the cluster are strongly consistent. Unless we enforce consistency of this information, relying on that assumption may have severe consequences.

例如, Apache Cassandra 中存在一个错误,这是由于架构更改在不同时间传播到服务器这一事实引起的。如果您在模式传播时尝试从数据库中读取数据,则可能会发生损坏,因为一台服务器假设一种模式对结果进行编码,而另一台服务器使用不同的模式对结果进行解码。

For example, there was a bug in Apache Cassandra, caused by the fact that schema changes propagate to servers at different times. If you tried to read from the database while the schema was propagating, there was a chance of corruption, since one server encoded results assuming one schema and the other one decoded them using a different schema.

另一个例子是由环的发散视图引起的错误:如果其中一个节点假设另一个节点保存某个键的数据记录,但另一个节点具有不同的集群视图,则可能会导致读取或写入数据当数据记录实际上很高兴地存在于另一个节点上时,数据记录放错位置或得到空响应。

Another example is a bug caused by the divergent view of the ring: if one of the nodes assumes that the other node holds data records for a key, but this other node has a different view of the cluster, reading or writing the data can result in misplacing data records or getting an empty response while data records are in fact happily present on the other node.

最好提前考虑可能出现的问题,即使实施完整的解决方案成本高昂。通过理解和处理这些案例,您可以嵌入防护措施或更改设计,使解决方案更加自然。

It is better to think about the possible problems in advance, even if a complete solution is costly to implement. By understanding and handling these cases, you can embed safeguards or change the design in a way that makes the solution more natural.

本地和远程执行

Local and Remote Execution

隐藏复杂性API 背后可能存在危险。例如,如果您在本地数据集上有一个迭代器,即使存储引擎不熟悉,您也可以合理地预测幕后发生的情况。了解远程数据集的迭代过程是一个完全不同的问题:您需要了解一致性和交付语义、数据协调、分页、合并、并发访问影响以及许多其他内容。

Hiding complexity behind an API might be dangerous. For example, if you have an iterator over the local dataset, you can reasonably predict what’s going on behind the scenes, even if the storage engine is unfamiliar. Understanding the process of iteration over the remote dataset is an entirely different problem: you need to understand consistency and delivery semantics, data reconciliation, paging, merges, concurrent access implications, and many other things.

简单地将两者隐藏在同一界面后面,无论多么有用,都可能会产生误导。调试、配置和可观察性可能需要其他 API 参数。我们应该始终记住,本地和远程执行是不同的 [WALDO96]

Simply hiding both behind the same interface, however useful, might be misleading. Additional API parameters may be necessary for debugging, configuration, and observability. We should always keep in mind that local and remote execution are not the same [WALDO96].

隐藏远程调用最明显的问题是延迟:远程调用的成本比本地调用高很多倍,因为它涉及双向网络传输、序列化/反序列化和许多其他步骤。交错本地调用和阻塞远程调用可能会导致性能下降和意想不到的副作用[VINOSKI08]

The most apparent problem with hiding remote calls is latency: remote invocation is many times more costly than the local one, since it involves two-way network transport, serialization/deserialization, and many other steps. Interleaving local and blocking remote calls may lead to performance degradation and unintended side effects [VINOSKI08].

需要处理失败

Need to Handle Failures

假设所有节点都已启动并正常运行,开始在系统上工作是可以的,但一直认为这种情况是危险的。在长时间运行的系统中,节点可能会因维护而被关闭(通常涉及正常关闭)或因各种原因而崩溃:软件问题、内存不足杀手 [KERRISK10]、运行时错误、硬件问题。确实会失败,而你能做的最好的事情就是为失败做好准备并了解如何处理它们。

It’s OK to start working on a system assuming that all nodes are up and functioning normally, but thinking this is the case all the time is dangerous. In a long-running system, nodes can be taken down for maintenance (which usually involves a graceful shutdown) or crash for various reasons: software problems, out-of-memory killer [KERRISK10], runtime bugs, hardware issues, etc. Processes do fail, and the best thing you can do is be prepared for failures and understand how to handle them.

如果远程服务器没有响应,我们并不总是知道其确切原因。这可能是由崩溃、网络故障、远程进程或其链接缓慢引起的。一些分布式算法使用心跳协议故障检测器来形成关于哪些参与者还活着并且可以访问的假设。

If the remote server doesn’t respond, we do not always know the exact reason for it. It could be caused by the crash, a network failure, the remote process, or the link to it being slow. Some distributed algorithms use heartbeat protocols and failure detectors to form a hypothesis about which participants are alive and reachable.

网络分区和部分故障

Network Partitions and Partial Failures

什么时候两台或多台服务器无法互相通信的情况我们称之为网络分区。在“对 CAP 定理的看法” [GILBERT12]中,Seth Gilbert 和 Nancy Lynch 对两个参与者无法相互通信的情况和几组参与者彼此隔离、无法交换消息的情况进行了区分,并继续算法。

When two or more servers cannot communicate with each other, we call the situation network partition. In “Perspectives on the CAP Theorem” [GILBERT12], Seth Gilbert and Nancy Lynch draw a distinction between the case when two participants cannot communicate with each other and when several groups of participants are isolated from one another, cannot exchange messages, and proceed with the algorithm.

一般的网络的不可靠性(丢包、重传、难以预测的延迟)令人烦恼,但可以忍受,而网络分区可能会造成更多麻烦,因为独立的组可以继续执行并产生相互冲突的结果。网络链接也可能会出现非对称故障:消息仍然可以从一个进程传送到另一个进程,但反之则不行。

General unreliability of the network (packet loss, retransmission, latencies that are hard to predict) are annoying but tolerable, while network partitions can cause much more trouble, since independent groups can proceed with execution and produce conflicting results. Network links can also fail asymmetrically: messages can still be getting delivered from one process to the other one, but not vice versa.

构建一个在一个或多个进程发生故障时仍具有鲁棒性的系统,我们必须考虑部分故障的情况 [TANENBAUM06],以及即使系统的一部分不可用或运行不正确,系统如何继续运行。

To build a system that is robust in the presence of failure of one or multiple processes, we have to consider cases of partial failures [TANENBAUM06] and how the system can continue operating even though a part of it is unavailable or functioning incorrectly.

故障很难检测到,并且从系统的不同部分并不总是以相同的方式可见。在设计高可用系统时,应该始终考虑边缘情况:如果我们确实复制了数据,但没有收到确认怎么办?我们需要重试吗?数据是否仍可在已发送确认的节点上读取?

Failures are hard to detect and aren’t always visible in the same way from different parts of the system. When designing highly available systems, one should always think about edge cases: what if we did replicate the data, but received no acknowledgments? Do we need to retry? Is the data still going to be available for reads on the nodes that have sent acknowledgments?

墨菲定律2告诉我们,失败确实会发生。编程民间传说补充说,故障将以最糟糕的方式发生,因此作为分布式系统工程师,我们的工作是确保减少出现问题的场景数量,并以包含故障可能造成的损害的方式为故障做好准备。

Murphy’s Law2 tells us that the failures do happen. Programming folklore adds that the failures will happen in the worst way possible, so our job as distributed systems engineers is to make sure we reduce the number of scenarios where things go wrong and prepare for failures in a way that contains the damage they can cause.

不可能防止所有故障,但我们仍然可以构建一个在故障存在时仍能正常运行的弹性系统。针对失败进行设计的最佳方法是对其进行测试。考虑每种可能的故障场景并预测多个进程的行为几乎是不可能的。设置测试工具来创建分区、模拟位衰减[GRAY05]、增加延迟、分散时钟和放大相对处理速度是实现这一目标的最佳方法。现实世界的分布式系统设置可能是相当对抗性的、不友好的和“创造性的”(但是,以一种非常敌对的方式),因此测试工作应该尝试覆盖尽可能多的场景。

It’s impossible to prevent all failures, but we can still build a resilient system that functions correctly in their presence. The best way to design for failures is to test for them. It’s close to impossible to think through every possible failure scenario and predict the behaviors of multiple processes. Setting up testing harnesses that create partitions, simulate bit rot [GRAY05], increase latencies, diverge clocks, and magnify relative processing speeds is the best way to go about it. Real-world distributed system setups can be quite adversarial, unfriendly, and “creative” (however, in a very hostile way), so the testing effort should attempt to cover as many scenarios as possible.

提示

在过去的几年中,我们看到了一些有助于重现不同故障场景的开源项目。Toxiproxy可以帮助模拟网络问题:限制带宽、引入延迟、超时等。Chaos Monkey采用了更激进的方法,通过随机关闭服务让工程师面临生产故障。CharybdeFS有助于模拟文件系统和硬件错误和故障。您可以使用这些工具来测试您的软件并确保它在出现这些故障时正常运行。CrashMonkey是一个与文件系统无关的记录重放和测试框架,有助于测试持久文件的数据和元数据一致性。

Over the last few years, we’ve seen a few open source projects that help to recreate different failure scenarios. Toxiproxy can help to simulate network problems: limit the bandwidth, introduce latency, timeouts, and more. Chaos Monkey takes a more radical approach and exposes engineers to production failures by randomly shutting down services. CharybdeFS helps to simulate filesystem and hardware errors and failures. You can use these tools to test your software and make sure it behaves correctly in the presence of these failures. CrashMonkey, a filesystem agnostic record-replay-and-test framework, helps test data and metadata consistency for persistent files.

在使用分布式系统时,我们必须认真对待容错、弹性、可能的故障场景和边缘情况。类似于“只要有足够多的眼球,所有错误都是浅薄的”,我们可以说足够大的集群最终将解决所有可能的问题。同时,通过足够的测试,我们最终将能够发现每个现有的问题。

When working with distributed systems, we have to take fault tolerance, resilience, possible failure scenarios, and edge cases seriously. Similar to “given enough eyeballs, all bugs are shallow,” we can say that a large enough cluster will eventually hit every possible issue. At the same time, given enough testing, we will be able to eventually find every existing problem.

级联故障

Cascading Failures

我们不能总是完全隔离故障:高负载下的进程翻倒会增加集群其余部分的负载,从而使其他节点更有可能发生故障。级联故障可以从系统的一个部分传播到另一部分,从而扩大问题的范围。

We cannot always wholly isolate failures: a process tipping over under a high load increases the load for the rest of cluster, making it even more probable for the other nodes to fail. Cascading failures can propagate from one part of the system to the other, increasing the scope of the problem.

有时,级联故障甚至可能是出于良好的意图而引发的。例如,某个节点离线一段时间,没有收到最新的更新。当它重新上线后,有帮助的同行会希望帮助它了解最近发生的事情,并开始将丢失的数据传输给它,从而耗尽网络资源或导致节点在启动后不久出现故障。

Sometimes, cascading failures can even be initiated by perfectly good intentions. For example, a node was offline for a while and did not receive the most recent updates. After it comes back online, helpful peers would like to help it to catch up with recent happenings and start streaming the data it’s missing over to it, exhausting network resources or causing the node to fail shortly after the startup.

提示

为了保护系统免受传播故障并妥善处理故障情况,可以使用断路器。在电气工程中,断路器通过中断电流来保护昂贵且难以更换的部件免受过载或短路的影响。在软件开发中,断路器监视故障并允许回退机制,该机制可以通过避开故障服务、给它一些恢复时间以及优雅地处理故障调用来保护系统。

To protect a system from propagating failures and treat failure scenarios gracefully, circuit breakers can be used. In electrical engineering, circuit breakers protect expensive and hard-to-replace parts from overload or short circuit by interrupting the current flow. In software development, circuit breakers monitor failures and allow fallback mechanisms that can protect the system by steering away from the failing service, giving it some time to recover, and handling failing calls gracefully.

当与其中一台服务器的连接失败或服务器没有响应时,客户端将启动重新连接循环。到那时,过载的服务器已经很难赶上新的连接请求,并且客户端在紧密循环中重试也无济于事。为了避免这种情况,我们可以使用退避策略。客户端不会立即重试,而是等待一段时间。退避可以通过安排重试和增加后续请求之间的时间窗口来帮助我们避免放大问题。

When the connection to one of the servers fails or the server does not respond, the client starts a reconnection loop. By that point, an overloaded server already has a hard time catching up with new connection requests, and client-side retries in a tight loop don’t help the situation. To avoid that, we can use a backoff strategy. Instead of retrying immediately, clients wait for some time. Backoff can help us to avoid amplifying problems by scheduling retries and increasing the time window between subsequent requests.

退避用于增加单个客户端请求之间的时间间隔。然而,使用相同退避策略的不同客户端也会产生大量负载。为了防止不同的客户端在退避期后一次重试,我们会引入抖动。抖动为退避添加了小的随机时间段,并降低了客户端同时唤醒和重试的概率。

Backoff is used to increase time periods between requests from a single client. However, different clients using the same backoff strategy can produce substantial load as well. To prevent different clients from retrying all at once after the backoff period, we can introduce jitter. Jitter adds small random time periods to backoff and reduces the probability of clients waking up and retrying at the same time.

硬件故障、位衰减和软件错误可能导致通过标准交付机制传播的腐败。例如,如果未经验证,损坏的数据记录可能会被复制到其他节点。如果没有适当的验证机制,系统可能会将损坏的数据传播到其他节点,从而可能覆盖未损坏的数据记录。为了避免这种情况,我们应该使用校验和和验证来验证节点之间交换的任何内容的完整性。

Hardware failures, bit rot, and software errors can result in corruption that can propagate through standard delivery mechanisms. For example, corrupted data records can get replicated to the other nodes if they are not validated. Without validation mechanisms in place, a system can propagate corrupted data to the other nodes, potentially overwriting noncorrupted data records. To avoid that, we should use checksumming and validation to verify the integrity of any content exchanged between the nodes.

超载通过规划和协调执行可以避免热点问题。我们可以使用协调器来根据可用资源准备执行计划,并根据过去可用的执行数据来预测负载,而不是让对等点独立执行操作步骤。

Overload and hotspotting can be avoided by planning and coordinating execution. Instead of letting peers execute operation steps independently, we can use a coordinator that prepares an execution plan based on the available resources and predicts the load based on the past execution data available to it.

总之,我们应该始终考虑系统某一部分的故障可能导致其他地方出现问题的情况。我们应该为我们的系统配备断路器、退避、验证和协调机制。处理小的孤立问题总是比尝试从大规模中断中恢复更直接。

In summary, we should always consider cases in which failures in one part of the system can cause problems elsewhere. We should equip our systems with circuit breakers, backoff, validation, and coordination mechanisms. Handling small isolated problems is always more straightforward than trying to recover from a large outage.

我们刚刚花了整整一节的时间讨论分布式系统中的问题和潜在的故障场景,但我们应该将其视为一个警告,而不是把我们吓跑的东西。

We’ve just spent an entire section discussing problems and potential failure scenarios in distributed systems, but we should see this as a warning and not as something that should scare us away.

了解可能出现问题的地方,并仔细设计和测试我们的系统,使它们更加强大和有弹性。了解这些问题可以帮助您在开发过程中识别和找到潜在的问题根源,并在生产中对其进行调试。

Understanding what can go wrong, and carefully designing and testing our systems makes them more robust and resilient. Being aware of these issues can help you to identify and find potential sources of problems during development, as well as debug them in production.

分布式系统抽象

Distributed Systems Abstractions

什么时候谈论编程语言时,我们使用通用术语,并根据函数、运算符、类、变量和指针来定义我们的程序。拥有共同的词汇可以帮助我们避免每次描述任何事物时都发明新词。我们的定义越精确、越清晰,听众就越容易理解我们。

When talking about programming languages, we use common terminology and define our programs in terms of functions, operators, classes, variables, and pointers. Having a common vocabulary helps us to avoid inventing new words every time we describe anything. The more precise and less ambiguous our definitions are, the easier it is for our listeners to understand us.

在讨论算法之前,我们首先必须介绍分布式系统词汇:您在演讲、书籍和论文中经常遇到的定义。

Before we move to algorithms, we first have to cover the distributed systems vocabulary: definitions you’ll frequently encounter in talks, books, and papers.

链接

Links

网络并不可靠:消息可能会丢失、延迟和重新排序。现在,带着这个想法,我们将尝试构建几种通信协议。我们将从最不可靠和最健壮的协议开始,确定它们可能处于的状态,并找出可以提供更好保证的协议中可能的附加内容。

Networks are not reliable: messages can get lost, delayed, and reordered. Now, with this thought in our minds, we will try to build several communication protocols. We’ll start with the least reliable and robust ones, identifying the states they can be in, and figuring out the possible additions to the protocol that can provide better guarantees.

公平损失链接

Fair-loss link

我们可以从两个进程开始,通过链接连接。进程之间可以互相发送消息,如图8-2所示。任何通信媒介都是不完美的,消息可能会丢失或延迟。

We can start with two processes, connected with a link. Processes can send messages to each other, as shown in Figure 8-2. Any communication medium is imperfect, and messages can get lost or delayed.

让我们看看我们能得到什么样的保证。消息M发送后,从发送者的角度来看,它可以处于以下状态之一:

Let’s see what kind of guarantees we can get. After the message M is sent, from the senders’ perspective, it can be in one of the following states:

  • 尚未交付流程(但将在某个时间点交付B

  • Not yet delivered to process B (but will be, at some point in time)

  • 运输过程中丢失且无法挽回

  • Irrecoverably lost during transport

  • 成功传递到远程进程

  • Successfully delivered to the remote process

请注意,发件人没有任何方法可以查明消息是否已送达。在分布式系统术语中,这种链接称为公平损耗。这种链接的特点是:

Notice that the sender does not have any way to find out if the message is already delivered. In distributed systems terminology, this kind of link is called fair-loss. The properties of this kind of link are:

公平损失
Fair loss

如果发送者和接收者都正确,并且发送者不断地无限次地重传消息,那么消息最终将被传递。3

If both sender and recipient are correct and the sender keeps retransmitting the message infinitely many times, it will eventually be delivered.3

有限重复
Finite duplication

发送消息不会被无限次地传递。

Sent messages won’t be delivered infinitely many times.

没有创作
No creation

链接不会出现消息;换句话说,它不会传递从未发送过的消息。

A link will not come up with messages; in other words, it won’t deliver the message that was never sent.

公平损耗链路是一种有用的抽象,也是具有强有力保证的通信协议的第一个构建块。我们可以假设此链接不会系统地丢失通信双方之间的消息,也不会创建新消息。但与此同时,我们也不能完全依赖它。这可能会让您想起用户数据报协议 (UDP),它允许我们从一个进程向另一个进程发送消息,但在协议级别上没有可靠的传递语义。

A fair-loss link is a useful abstraction and a first building block for communication protocols with strong guarantees. We can assume that this link is not losing messages between communicating parties systematically and doesn’t create new messages. But, at the same time, we cannot entirely rely on it. This might remind you of the User Datagram Protocol (UDP), which allows us to send messages from one process to the other, but does not have reliable delivery semantics on the protocol level.

消息确认

Message acknowledgments

为了改善这种情况并使消息状态更加清晰,我们可以引入确认:一种收件人通知发件人已收到消息的方式。为此,我们需要使用双向通信通道并添加一些方法来区分消息之间的差异;例如,序列号,它是唯一的单调递增消息标识符。

To improve the situation and get more clarity in terms of message status, we can introduce acknowledgments: a way for the recipient to notify the sender that it has received the message. For that, we need to use bidirectional communication channels and add some means that allow us to distinguish differences between the messages; for example, sequence numbers, which are unique monotonically increasing message identifiers.

笔记

每条消息都有一个唯一的标识符就足够了。序列号只是唯一标识符的一种特殊情况,我们通过从计数器中提取标识符来实现唯一性。当使用哈希算法来唯一地识别消息时,我们应该考虑可能的冲突并确保我们仍然可以消除消息的歧义。

It is enough to have a unique identifier for every message. Sequence numbers are just a particular case of a unique identifier, where we achieve uniqueness by drawing identifiers from a counter. When using hash algorithms to identify messages uniquely, we should account for possible collisions and make sure we can still disambiguate messages.

现在,进程A可以发送消息M(n),其中n是单调递增的消息计数器。一旦B收到消息,它就会ACK(n)向 A 发回一个确认。图 8-3显示了这种通信形式。

Now, process A can send a message M(n), where n is a monotonically increasing message counter. As soon as B receives the message, it sends an acknowledgment ACK(n) back to A. Figure 8-3 shows this form of communication.

确认以及原始消息可能会在途中丢失。消息的状态数可能会略有变化。在A收到确认之前,消息仍然处于我们之前提到的三种状态之一,但是一旦A收到确认,就可以确信消息已传递到B

The acknowledgment, as well as the original message, may get lost on the way. The number of states the message can be in changes slightly. Until A receives an acknowledgment, the message is still in one of the three states we mentioned previously, but as soon as A receives the acknowledgment, it can be confident that the message is delivered to B.

消息重传

Message retransmits

添加致谢仍然不足以称此通信协议可靠:发送的消息仍然可能丢失,或者远程进程可能在确认之前失败为了解决这个问题并提供传送保证,我们可以尝试重传。重传是发送方重试可能失败的操作的一种方式。我们说可能失败,因为发送者并不真正知道它是否失败,因为我们要讨论的链接类型不使用确认

Adding acknowledgments is still not enough to call this communication protocol reliable: a sent message may still get lost, or the remote process may fail before acknowledging it. To solve this problem and provide delivery guarantees, we can try retransmits instead. Retransmits are a way for the sender to retry a potentially failed operation. We say potentially failed, because the sender doesn’t really know whether it has failed or not, since the type of link we’re about to discuss does not use acknowledgments.

进程A发送消息后M,它会等待直到触发超时T并尝试再次发送相同的消息。假设进程之间的链路保持完整,进程之间的网络分区不是无限的,并且并非所有数据包都丢失,我们可以说,从发送者的角度来看,消息要么尚未传递到进程B,要么已成功传递到进程B。由于A不断尝试发送消息,我们可以说它在传输过程中不会发生不可挽回的丢失。

After process A sends message M, it waits until timeout T is triggered and tries to send the same message again. Assuming the link between processes stays intact, network partitions between the processes are not infinite, and not all packets are lost, we can state that, from the sender’s perspective, the message is either not yet delivered to process B or is successfully delivered to process B. Since A keeps trying to send the message, we can say that it cannot get irrecoverably lost during transport.

在分布式系统术语中,这种抽象称为顽固链接。之所以称为顽固,是因为发送者不断地一次又一次地无限期地重新发送消息,但是,由于这种抽象非常不切实际,因此我们需要将重试与确认结合起来。

In distributed systems terminology, this abstraction is called a stubborn link. It’s called stubborn because the sender keeps resending the message again and again indefinitely, but, since this sort of abstraction would be highly impractical, we need to combine retries with acknowledgments.

重传问题

Problem with retransmits

每当我们发送消息,直到收到远程进程的确认,我们不知道该消息是否已被处理,是否将很快处理,是否已丢失,或者远程进程在接收之前已崩溃 - 任何一个这些状态是可能的。我们可以重试该操作并再次发送消息,但这可能会导致消息重复。仅当我们要执行的操作是幂等的时,处理重复项才是安全的。

Whenever we send the message, until we receive an acknowledgment from the remote process, we do not know whether it has already been processed, it will be processed shortly, it has been lost, or the remote process has crashed before receiving it—any one of these states is possible. We can retry the operation and send the message again, but this can result in message duplicates. Processing duplicates is only safe if the operation we’re about to perform is idempotent.

操作是可以多次执行,产生相同的结果而不产生额外的副作用。例如,服务器关闭操作可以是幂等的,第一次调用启动关闭,并且所有后续调用不会产生任何附加效果。

An idempotent operation is one that can be executed multiple times, yielding the same result without producing additional side effects. For example, a server shutdown operation can be idempotent, the first call initiates the shutdown, and all subsequent calls do not produce any additional effects.

如果每个操作都是幂等的,我们就可以更少地考虑传递语义,更多地依赖重传来实现容错,并以完全被动的方式构建系统:触发一个操作作为对某些信号的响应,而不会导致意外的副作用。然而,操作不一定是幂等的,仅仅假设它们是幂等的可能会导致集群范围内的副作用。例如,向客户的信用卡收费就不是幂等的,多次收费肯定是不可取的。

If every operation was idempotent, we could think less about delivery semantics, rely more on retransmits for fault tolerance, and build systems in an entirely reactive way: triggering an action as a response to some signal, without causing unintended side effects. However, operations are not necessarily idempotent, and merely assuming that they are might lead to cluster-wide side effects. For example, charging a customer’s credit card is not idempotent, and charging it multiple times is definitely undesirable.

在存在部分故障和网络分区的情况下,幂等性尤其重要,因为我们无法总能找出远程操作的确切状态——是否成功、失败或即将执行——而我们只能等待更长的时间。由于保证每个执行的操作都是幂等的是一个不切实际的要求,因此我们需要在不改变底层操作语义的情况下提供等同于幂等的保证。为了实现这一目标,我们可以使用重复数据删除并避免多次处理消息。

Idempotence is particularly important in the presence of partial failures and network partitions, since we cannot always find out the exact status of a remote operation—whether it has succeeded, failed, or will be executed shortly—and we just have to wait longer. Since guaranteeing that each executed operation is idempotent is an unrealistic requirement, we need to provide guarantees equivalent to idempotence without changing the underlying operation semantics. To achieve this, we can use deduplication and avoid processing messages more than once.

留言订购

Message order

网络不可靠给我们带来了两个问题:消息可能无序到达,并且由于重传,某些消息可能会多次到达。我们已经引入了序列号,我们可以在接收方使用这些消息标识符来确保先进先出(FIFO) 顺序。由于每条消息都有一个序列号,因此接收者可以跟踪:

Unreliable networks present us with two problems: messages can arrive out of order and, because of retransmits, some messages may arrive more than once. We have already introduced sequence numbers, and we can use these message identifiers on the recipient side to ensure first-in, first-out (FIFO) ordering. Since every message has a sequence number, the receiver can track:

  • nconsecutive,指定最高序列号,直到它看到的所有消息。最多可以按顺序放回此数量的消息。

  • nconsecutive, specifying the highest sequence number, up to which it has seen all messages. Messages up to this number can be put back in order.

  • nprocessed,指定最高序列号,直到消息按原始顺序放回并进行处理。该数字可用于重复数据删除。

  • nprocessed, specifying the highest sequence number, up to which messages were put back in their original order and processed. This number can be used for deduplication.

如果接收到的消息具有不连续的序列号,则接收方将其放入重排序缓冲区中。例如,它5在收到带有 的消息后又收到了带有序列号的消息3,并且我们知道该消息4仍然丢失,因此我们需要搁置5直到4到来,然后我们可以重建消息顺序。由于我们建立在公平损失链接之上,因此我们假设 和 之间的消息最终将被传递。nconsecutivenmax_seen

If the received message has a nonconsecutive sequence number, the receiver puts it into the reordering buffer. For example, it receives a message with a sequence number 5 after receiving one with 3, and we know that 4 is still missing, so we need to put 5 aside until 4 comes, and we can reconstruct the message order. Since we’re building on top of a fair-loss link, we assume that messages between nconsecutive and nmax_seen will eventually be delivered.

接收者可以安全地丢弃序列号最多为接收到的消息,因为它们保证已被传递。nconsecutive

The recipient can safely discard the messages with sequence numbers up to nconsecutive that it receives, since they’re guaranteed to be already delivered.

重复数据删除的工作原理是检查带有序列号的消息是否n已被处理(由接收者在堆栈中传递)并丢弃已处理的消息。

Deduplication works by checking if the message with a sequence number n has already been processed (passed down the stack by the receiver) and discarding already processed messages.

在分布式系统术语中,这种类型的链接称为一个完美的链接,它提供以下保证[CACHIN11]

In distributed systems terms, this type of link is called a perfect link, which provides the following guarantees [CACHIN11]:

可靠的交付
Reliable delivery

每一个消息一旦由正确的进程发送A到正确的进程B最终就会被传递。

Every message sent once by the correct process A to the correct process B, will eventually be delivered.

无重复
No duplication

消息不会被传递多次。

No message is delivered more than once.

没有创作
No creation

与其他类型的链接一样,它只能传递实际发送的消息。

Same as with other types of links, it can only deliver the messages that were actually sent.

这可能会让您想起 TCP 4协议(但是,TCP 中的可靠传输仅在单个会话范围内得到保证)。当然,这个模型只是我们仅用于说明目的的简化表示。TCP 有一个更复杂的模型来处理确认,它将确认分组并减少协议级开销。此外,TCP 还具有选择性确认、流量控制、拥塞控制、错误检测以及许多其他功能,这些功能超出了我们的讨论范围。

This might remind you of the TCP4 protocol (however, reliable delivery in TCP is guaranteed only in the scope of a single session). Of course, this model is just a simplified representation we use for illustration purposes only. TCP has a much more sophisticated model for dealing with acknowledgments, which groups acknowledgments and reduces the protocol-level overhead. In addition, TCP has selective acknowledgments, flow control, congestion control, error detection, and many other features that are out of the scope of our discussion.

一次性交付

Exactly-once delivery

分布式系统中只有两个难题: 2. 一次传递 1. 保证消息的顺序 2. 一次传递。

马蒂亚斯·韦拉斯

There are only two hard problems in distributed systems: 2. Exactly-once delivery 1. Guaranteed order of messages 2. Exactly-once delivery.

Mathias Verraes

那里关于一次性交付是否可行已经有很多讨论。在这里,语义和精确的措辞至关重要。由于可能存在链接故障,导致消息无法从第一次尝试传递,因此大多数现实世界的系统都采用至少一次传递,这确保发送者重试直到收到确认,否则消息不会传递。视为已收到。另一种传递语义是最多一次:发送者发送消息并且不期望任何传递确认。

There have been many discussions about whether or not exactly-once delivery is possible. Here, semantics and precise wording are essential. Since there might be a link failure preventing the message from being delivered from the first try, most of the real-world systems employ at-least-once delivery, which ensures that the sender retries until it receives an acknowledgment, otherwise the message is not considered to be received. Another delivery semantic is at-most-once: the sender sends the message and doesn’t expect any delivery confirmation.

TCP 协议的工作原理是将消息分解为数据包,将它们一一传输,然后在接收端将它们重新拼接在一起。TCP 可能会尝试重新传输某些数据包,并且多次传输尝试可能会成功。由于TCP用序列号标记每个数据包,即使某些数据包被多次传输,它也可以对数据包进行重复数据删除,并保证接收者只会看到该消息并仅处理一次在 TCP 中,这种保证仅对单个会话有效:如果消息被确认并处理,但发送者在连接中断之前没有收到确认,则应用程序不知道此传递,并且取决于其逻辑,它可能会尝试再次发送消息。

The TCP protocol works by breaking down messages into packets, transmitting them one by one, and stitching them back together on the receiving side. TCP might attempt to retransmit some of the packets, and more than one transmission attempt may succeed. Since TCP marks each packet with a sequence number, even though some packets were transmitted more than once, it can deduplicate the packets and guarantee that the recipient will see the message and process it only once. In TCP, this guarantee is valid only for a single session: if the message is acknowledged and processed, but the sender didn’t receive the acknowledgment before the connection was interrupted, the application is not aware of this delivery and, depending on its logic, it might attempt to send the message once again.

这意味着一次性处理在这里很有趣,因为重复传送(或数据包传输)没有副作用,而只是链路尽力而为的产物。例如,如果数据库节点只收到了记录,但没有持久化它,则发生了传递,但除非可以检索该记录(换句话说,除非它既被传递又被处理),否则它没有任何用处。 )。

This means that exactly-once processing is what’s interesting here since duplicate deliveries (or packet transmissions) have no side effects and are merely an artifact of the best effort by the link. For example, if the database node has only received the record, but hasn’t persisted it, delivery has occurred, but it’ll be of no use unless the record can be retrieved (in other words, unless it was both delivered and processed).

为了保持一次性保证,节点应该有一个共同的知识 [HALPERN90]:每个人都知道一些事实,并且每个人都知道其他人也知道这个事实。简而言之,节点必须就记录的状态达成一致:两个节点都同意它要么持久化,要么未被持久化。正如您将在本章后面看到的,这在理论上是不可能的,但在实践中我们仍然通过放宽协调要求来使用这个概念。

For the exactly-once guarantee to hold, nodes should have a common knowledge [HALPERN90]: everyone knows about some fact, and everyone knows that everyone else also knows about that fact. In simplified terms, nodes have to agree on the state of the record: both nodes agree that it either was or was not persisted. As you will see later in this chapter, this is theoretically impossible, but in practice we still use this notion by relaxing coordination requirements.

任何关于一次性交付是否可能的误解很可能来自于从不同的协议和抽象级别以及“交付”的定义来处理问题。如果不多次传输任何消息,就不可能建立可靠的链接,但我们可以通过处理一次消息并忽略重复项,从发送者的角度创建仅一次传递的假象。

Any misunderstanding about whether or not exactly-once delivery is possible most likely comes from approaching the problem from different protocol and abstraction levels and the definition of “delivery.” It’s not possible to build a reliable link without ever transferring any message more than once, but we can create the illusion of exactly-once delivery from the sender’s perspective by processing the message once and ignoring duplicates.

现在,随着我们已经建立了可靠通信的手段,我们可以继续寻找在分布式系统中的进程之间实现一致性和协议的方法。

Now, as we have established the means for reliable communication, we can move ahead and look for ways to achieve uniformity and agreement between processes in the distributed system.

两位将军的问题

Two Generals’ Problem

对分布式系统中的协议最突出的描述是一个广泛称为“两将军问题”的思想实验。

One of the most prominent descriptions of an agreement in a distributed system is a thought experiment widely known as the Two Generals’ Problem.

思想实验表明,在存在链路故障的情况下,如果通信是异步的,则两方之间不可能达成一致。尽管 TCP 具有完美链接的属性,但重要的是要记住,完美链接(尽管名称如此)并不能保证完美传送。他们也不能保证参与者一直活着,只关心交通。

This thought experiment shows that it is impossible to achieve an agreement between two parties if communication is asynchronous in the presence of link failures. Even though TCP exhibits properties of a perfect link, it’s important to remember that perfect links, despite the name, do not guarantee perfect delivery. They also can’t guarantee that participants will be alive the whole time, and are concerned only with transport.

想象一下,由两位将军率领的两支军队,准备进攻一座坚固的城市。两军分布在城市的两侧,只有同时进攻才能围攻成功。

Imagine two armies, led by two generals, preparing to attack a fortified city. The armies are located on two sides of the city and can succeed in their siege only if they attack simultaneously.

将军们可以通过派遣使者进行沟通,并且已经制定了进攻计划。他们现在唯一需要达成一致的是是否执行该计划。这个问题的变体是当其中一位将军的军衔更高,但需要确保攻击协调一致时;或者将军们需要就确切的时间达成一致。这些细节不会改变问题的定义:将军们必须达成协议。

The generals can communicate by sending messengers, and already have devised an attack plan. The only thing they now have to agree on is whether or not to carry out the plan. Variants of this problem are when one of the generals has a higher rank, but needs to make sure the attack is coordinated; or that the generals need to agree on the exact time. These details do not change the problem definition: the generals have to come to an agreement.

陆军将领只需同意他们将继续进攻即可。否则,攻击无法成功。将军A发送一条消息,表明如果对方也同意MSG(N)继续攻击,则打算在指定时间继续攻击。

The army generals only have to agree on the fact that they both will proceed with the attack. Otherwise, the attack cannot succeed. General A sends a message MSG(N), stating an intention to proceed with the attack at a specified time, if the other party agrees to proceed as well.

发送信使后A,他不知道信使是否到达:信使可能会被捕获而无法传递消息。当将军B收到消息时,他必须发送确认信息ACK(MSG(N))图 8-4显示消息以单向发送并由另一方确认。

After A sends the messenger, he doesn’t know whether the messenger has arrived or not: the messenger can get captured and fail to deliver the message. When general B receives the message, he has to send an acknowledgment ACK(MSG(N)). Figure 8-4 shows that a message is sent one way and acknowledged by the other party.

数据库0804
图 8-4。两位将军的问题说明

携带此确认信息的信使也可能会被捕获或无法传递该确认信息。B没有任何方法知道信使是否已成功发送确认。

The messenger carrying this acknowledgment might get captured or fail to deliver it, as well. B doesn’t have any way of knowing if the messenger has successfully delivered the acknowledgment.

为了确定这一点,B必须等待ACK(ACK(MSG(N))),一个二阶确认,表明A已收到该确认的确认。

To be sure about it, B has to wait for ACK(ACK(MSG(N))), a second-order acknowledgment stating that A received an acknowledgment for the acknowledgment.

无论将军们互相发送多少进一步的确认信息,他们始终无法ACK知道是否可以安全地继续进攻。将军们注定要怀疑携带最后确认的消息是否已到达目的地。

No matter how many further confirmations the generals send to each other, they will always be one ACK away from knowing if they can safely proceed with the attack. The generals are doomed to wonder if the message carrying this last acknowledgment has reached the destination.

请注意,我们没有做出任何时序假设:将军之间的通信是完全异步的。对于将军们可以花多长时间做出回应,没有设定上限。

Notice that we did not make any timing assumptions: communication between generals is fully asynchronous. There is no upper time bound set on how long the generals can take to respond.

FLP 不可能

FLP Impossibility

在费舍尔、林奇和帕特森的一篇论文中,作者描述了一个众所周知的问题 FLP 不可能问题 [FISCHER85](源自作者姓氏的第一个字母),其中他们讨论了一种共识形式,其中过程从初始值开始并尝试就新值达成一致。算法完成后,这个新值对于所有无故障进程必须相同。

In a paper by Fisher, Lynch, and Paterson, the authors describe a problem famously known as the FLP Impossibility Problem [FISCHER85] (derived from the first letters of authors’ last names), wherein they discuss a form of consensus in which processes start with an initial value and attempt to agree on a new value. After the algorithm completes, this new value has to be the same for all nonfaulty processes.

如果网络完全可靠,就特定值达成一致就很简单;但实际上,系统很容易出现许多不同类型的故障,例如消息丢失、重复、网络分区以及进程缓慢或崩溃。

Reaching an agreement on a specific value is straightforward if the network is entirely reliable; but in reality, systems are prone to many different sorts of failures, such as message loss, duplication, network partitions, and slow or crashed processes.

A共识协议描述了一个系统,给定多个进程从其初始状态开始,使所有进程进入决策状态。为了使共识协议正确,它必须保留三个属性:

A consensus protocol describes a system that, given multiple processes starting at its initial state, brings all of the processes to the decision state. For a consensus protocol to be correct, it has to preserve three properties:

协议
Agreement

协议达成的决定必须是一致的:每个进程都决定某个值,并且这对于所有进程都必须相同。否则,我们还没有达成共识。

The decision the protocol arrives at has to be unanimous: each process decides on some value, and this has to be the same for all processes. Otherwise, we have not reached a consensus.

有效性
Validity

商定的值必须由参与者之一提出,这意味着系统不应该只是“提出”该值。这也意味着该值的重要性:进程不应总是决定某些预定义的默认值。

The agreed value has to be proposed by one of the participants, which means that the system should not just “come up” with the value. This also implies nontriviality of the value: processes should not always decide on some predefined default value.

终止
Termination

仅当没有未达到决策状态的流程时,协议才是最终协议。

An agreement is final only if there are no processes that did not reach the decision state.

[FISCHER85]假设处理完全是异步的;进程之间没有共享的时间概念。此类系统中的算法不能基于超时,并且进程无法查明另一个进程是否已崩溃或只是运行太慢。该论文表明,考虑到这些假设,不存在可以保证在有限时间内达成共识的协议。没有一种完全异步的共识算法能够容忍单个远程进程的突然崩溃。

[FISCHER85] assumes that processing is entirely asynchronous; there’s no shared notion of time between the processes. Algorithms in such systems cannot be based on timeouts, and there’s no way for a process to find out whether the other process has crashed or is simply running too slow. The paper shows that, given these assumptions, there exists no protocol that can guarantee consensus in a bounded time. No completely asynchronous consensus algorithm can tolerate the unannounced crash of even a single remote process.

如果我们不考虑进程完成算法步骤的上限时间,则无法可靠地检测到进程故障,并且没有确定性算法来达成共识。

If we do not consider an upper time bound for the process to complete the algorithm steps, process failures can’t be reliably detected, and there’s no deterministic algorithm to reach a consensus.

然而,FLP不可能并不意味着我们必须收拾行李回家,因为达成共识是不可能的。这仅意味着我们不能总是在有限的时间内在异步系统中达成共识。在实践中,系统至少表现出一定程度的同步性,而解决这个问题需要更精细的模型。

However, FLP Impossibility does not mean we have to pack our things and go home, as reaching consensus is not possible. It only means that we cannot always reach consensus in an asynchronous system in bounded time. In practice, systems exhibit at least some degree of synchrony, and the solution to this problem requires a more refined model.

系统同步

System Synchrony

FLP Impossibility,可以看到时序假设是分布式系统的关键特性之一。在异步系统中,我们不知道进程的相对速度,并且不能保证消息在有限的时间内或特定的顺序传递。该流程可能需要无限长时间才能响应,并且无法始终可靠地检测到流程故障。

From FLP Impossibility, you can see that the timing assumption is one of the critical characteristics of the distributed system. In an asynchronous system, we do not know the relative speeds of processes, and cannot guarantee message delivery in a bounded time or a particular order. The process might take indefinitely long to respond, and process failures can’t always be reliably detected.

对异步系统的主要批评是这些假设不现实:进程不能具有任意不同的处理速度,并且链接不会无限期地花费很长时间来传递消息。依赖时间既可以简化推理,又有助于提供上限时序保证。

The main criticism of asynchronous systems is that these assumptions are not realistic: processes can’t have arbitrarily different processing speeds, and links don’t take indefinitely long to deliver messages. Relying on time both simplifies reasoning and helps to provide upper-bound timing guarantees.

在异步模型中并不总是能够解决共识问题[FISCHER85]。此外,设计高效的同步算法并不总是可以实现的,并且对于某些任务,实际的解决方案更有可能依赖于时间[ARJOMANDI83]

It is not always possible to solve a consensus problem in an asynchronous model [FISCHER85]. Moreover, designing an efficient synchronous algorithm is not always achievable, and for some tasks the practical solutions are more likely to be time-dependent [ARJOMANDI83].

这些假设可以放宽,系统可以被认为是同步的。为此,我们引入时间的概念。在同步模型下推理系统要容易得多。它假设进程以相当的速率进行,传输延迟是有限的,并且消息传递不能花费任意长的时间。

These assumptions can be loosened up, and the system can be considered to be synchronous. For that, we introduce the notion of timing. It is much easier to reason about the system under the synchronous model. It assumes that processes are progressing at comparable rates, that transmission delays are bounded, and message delivery cannot take arbitrarily long.

同步系统也可以用同步的进程本地时钟来表示:两个进程本地时间源之间的时间差存在上限[CACHIN11]

A synchronous system can also be represented in terms of synchronized process-local clocks: there is an upper time bound in time difference between the two process-local time sources [CACHIN11].

在同步模型下设计系统允许我们使用超时。我们可以在它们之上构建更复杂的抽象,例如领导者选举、共识、故障检测以及许多其他抽象。这使得最好的情况更加稳健,但如果时序假设不成立,就会导致失败。例如,在 Raft 共识算法(参见“Raft”)中,我们最终可能会出现多个进程认为自己是领导者的情况,这是通过强制落后的进程接受另一个进程作为领导者来解决的;故障检测算法(参见第 9 章)可能会错误地将实时进程识别为故障,反之亦然。在设计我们的系统时,我们应该确保考虑这些可能性。

Designing systems under a synchronous model allows us to use timeouts. We can build more complex abstractions, such as leader election, consensus, failure detection, and many others on top of them. This makes the best-case scenarios more robust, but results in a failure if the timing assumptions don’t hold up. For example, in the Raft consensus algorithm (see “Raft”), we may end up with multiple processes believing they’re leaders, which is resolved by forcing the lagging process to accept the other process as a leader; failure-detection algorithms (see Chapter 9) can wrongly identify a live process as failed or vice versa. When designing our systems, we should make sure to consider these possibilities.

特性异步和同步模型可以结合起来,我们可以将系统视为部分同步。部分同步系统表现出同步系统的一些属性,但消息传递、时钟漂移和相对处理速度的界限可能不准确,并且仅在大多数情况下保持[ DWORK88]

Properties of both asynchronous and synchronous models can be combined, and we can think of a system as partially synchronous. A partially synchronous system exhibits some of the properties of the synchronous system, but the bounds of message delivery, clock drift, and relative processing speeds might not be exact and hold only most of the time [DWORK88].

同步是分布式系统的一个基本属性:它对性能、可扩展性和一般可解决性有影响,并且具有系统正确运行所必需的许多因素。我们在本书中讨论的一些算法是在同步系统的假设下运行的。

Synchrony is an essential property of the distributed system: it has an impact on performance, scalability, and general solvability, and has many factors necessary for the correct functioning of our systems. Some of the algorithms we discuss in this book operate under the assumptions of synchronous systems.

失败模型

Failure Models

我们不断提及失败,但到目前为止,它一直是一个相当广泛和通用的概念,可能包含许多含义。与我们如何做出不同的时序假设类似,我们可以假设存在不同类型的故障。A 故障模型准确地描述了分布式系统中进程如何崩溃,并且使用这些假设开发了算法。例如,我们可以假设一个进程可能崩溃并且永远不会恢复,或者预计在一段时间后恢复,或者它可能因失控和提供不正确的值而失败。

We keep mentioning failures, but so far it has been a rather broad and generic concept that might capture many meanings. Similar to how we can make different timing assumptions, we can assume the presence of different types of failures. A failure model describes exactly how processes can crash in a distributed system, and algorithms are developed using these assumptions. For example, we can assume that a process can crash and never recover, or that it is expected to recover after some time passes, or that it can fail by spinning out of control and supplying incorrect values.

在分布式系统中,进程相互依赖来执行算法,因此故障可能会导致整个系统的错误执行。

In distributed systems, processes rely on one another for executing an algorithm, so failures can result in incorrect execution across the whole system.

我们将讨论分布式系统中存在的多种故障模型,例如崩溃遗漏任意故障。该列表并不详尽,但涵盖了现实系统中适用且重要的大多数情况。

We’ll discuss multiple failure models present in distributed systems, such as crash, omission, and arbitrary faults. This list is not exhaustive, but it covers most of the cases applicable and important in real-life systems.

崩溃故障

Crash Faults

通常情况下,我们期望进程正确执行算法的所有步骤。进程崩溃的最简单方法是停止执行算法所需的任何进一步步骤,并且不向其他进程发送任何消息。换句话说,进程崩溃了。大多数时候,我们假设一个崩溃停止进程抽象,它规定一旦进程崩溃,它就会保持这种状态。

Normally, we expect the process to be executing all steps of an algorithm correctly. The simplest way for a process to crash is by stopping the execution of any further steps required by the algorithm and not sending any messages to other processes. In other words, the process crashes. Most of the time, we assume a crash-stop process abstraction, which prescribes that, once the process has crashed, it remains in this state.

该模型并不假设进程不可能恢复,并且不会阻止恢复或尝试阻止恢复。这仅意味着该算法不依赖于恢复来获得正确性或活性。没有什么可以阻止进程恢复、赶上系统状态并参与算法的下一个实例。

This model does not assume that it is impossible for the process to recover, and does not discourage recovery or try to prevent it. It only means that the algorithm does not rely on recovery for correctness or liveness. Nothing prevents processes from recovering, catching up with the system state, and participating in the next instance of the algorithm.

失败的进程无法继续参与失败的本轮谈判。为恢复进程分配一个新的、不同的标识并不会使模型等同于崩溃恢复(接下来讨论),因为大多数算法使用预定义的进程列表,并根据它们可以容忍的故障数量明确定义故障语义 [CACHIN11 ]

Failed processes are not able to continue participating in the current round of negotiations during which they failed. Assigning the recovering process a new, different identity does not make the model equivalent to crash-recovery (discussed next), since most algorithms use predefined lists of processes and clearly define failure semantics in terms of how many failures they can tolerate [CACHIN11].

崩溃恢复不同的进程抽象,在该抽象下,进程停止执行算法所需的步骤,但稍后恢复并尝试执行进一步的步骤。恢复的可能性需要在系统中引入持久状态和恢复协议[SKEEN83]。允许崩溃恢复的算法需要考虑所有可能的恢复状态,因为恢复过程可能会尝试从它已知的最后一步继续执行。

Crash-recovery is a different process abstraction, under which the process stops executing the steps required by the algorithm, but recovers at a later point and tries to execute further steps. The possibility of recovery requires introducing a durable state and recovery protocol into the system [SKEEN83]. Algorithms that allow crash-recovery need to take all possible recovery states into consideration, since the recovering process may attempt to continue execution from the last step known to it.

旨在利用恢复的算法必须同时考虑状态和身份。在这种情况下,崩溃恢复也可以被视为遗漏失败的一种特殊情况,因为从另一个进程的角度来看,无法访问的进程与已崩溃并恢复的进程之间没有区别。

Algorithms, aiming to exploit recovery, have to take both state and identity into account. Crash-recovery, in this case, can also be viewed as a special case of omission failure, since from the other process’s perspective there’s no distinction between the process that was unreachable and the one that has crashed and recovered.

遗漏故障

Omission Faults

其他故障模型为遗漏故障。该模型假设该进程跳过某些算法步骤,或者无法执行这些步骤,或者此执行对其他参与者不可见,或者它无法向其他参与者发送消息或从其他参与者接收消息。遗漏故障捕获由于网络链路故障、交换机故障或网络拥塞引起的进程之间的网络分区。网络分区可以表示为各个进程或进程组之间的消息遗漏。可以通过完全忽略进出进程的任何消息来模拟崩溃。

Another failure model is omission fault. This model assumes that the process skips some of the algorithm steps, or is not able to execute them, or this execution is not visible to other participants, or it cannot send or receive messages to and from other participants. Omission fault captures network partitions between the processes caused by faulty network links, switch failures, or network congestion. Network partitions can be represented as omissions of messages between individual processes or process groups. A crash can be simulated by completely omitting any messages to and from the process.

当进程的运行速度比其他参与者慢并且发送响应比预期晚得多时,对于系统的其余部分来说,它可能看起来像是健忘的。慢速节点不会完全停止,而是尝试发送与其他节点不同步的结果。

When the process is operating slower than the other participants and sends responses much later than expected, for the rest of the system it may look like it is forgetful. Instead of stopping completely, a slow node attempts to send its results out of sync with other nodes.

当应该执行某些步骤的算法跳过这些步骤或者执行的结果不可见时,就会发生遗漏失败。例如,如果消息在发送给收件人的途中丢失,并且发送者无法再次发送该消息并继续操作,就好像它已成功传递一样,即使它已不可恢复地丢失,也可能会发生这种情况。遗漏故障也可能是由间歇性挂起、网络过载、队列满等引起的。

Omission failures occur when the algorithm that was supposed to execute certain steps either skips them or the results of this execution are not visible. For example, this may happen if the message is lost on the way to the recipient, and the sender fails to send it again and continues to operate as if it was successfully delivered, even though it was irrecoverably lost. Omission failures can also be caused by intermittent hangs, overloaded networks, full queues, etc.

任意故障

Arbitrary Faults

最难克服的故障类别是任意拜占庭错误:进程继续执行算法步骤,但以与算法相矛盾的方式(例如,如果共识算法中的进程决定了其他参与者从未提出过的值) )。

The hardest class of failures to overcome is arbitrary or Byzantine faults: a process continues executing the algorithm steps, but in a way that contradicts the algorithm (for example, if a process in a consensus algorithm decides on a value that no other participant has ever proposed).

此类故障可能是由于软件错误或由于运行不同版本算法的进程而发生的,在这种情况下,故障更容易发现和理解。当我们无法控制所有流程并且其中一个流程故意误导其他流程时,事情会变得更加困难。

Such failures can happen due to bugs in software, or due to processes running different versions of the algorithm, in which case failures are easier to find and understand. It can get much more difficult when we do not have control over all processes, and one of the processes is intentionally misleading other processes.

您可能听说过航空航天工业中的拜占庭容错:飞机和航天器系统不会按表面值接收子组件的响应并交叉验证其结果。另一个广泛的应用是加密货币[GILAD17],其中没有中央权威,不同方控制节点,对手参与者有物质动机来伪造价值并试图通过提供错误的响应来博弈系统。

You might have heard of Byzantine fault tolerance from the airspace industry: airplane and spacecraft systems do not take responses from subcomponents at face value and cross-validate their results. Another widespread application is cryptocurrencies [GILAD17], where there is no central authority, different parties control the nodes, and adversary participants have a material incentive to forge values and attempt to game the system by providing faulty responses.

处理故障

Handling Failures

我们可以通过形成进程组并向算法中引入冗余来掩盖故障:即使其中一个进程发生故障,用户也不会注意到此故障[CHRISTIAN91]

We can mask failures by forming process groups and introducing redundancy into the algorithm: even if one of the processes fails, the user will not notice this failure [CHRISTIAN91].

可能会出现一些与故障相关的性能损失:正常执行依赖于进程的响应,并且系统必须回退到较慢的执行路径以进行错误处理和纠正。通过代码审查、广泛的测试、通过引入超时和重试来确保消息传递以及确保步骤在本地按顺序执行,可以在软件级别上防止许多故障。

There might be some performance penalty related to failures: normal execution relies on processes being responsive, and the system has to fall back to the slower execution path for error handling and correction. Many failures can be prevented on the software level by code reviews, extensive testing, ensuring message delivery by introducing timeouts and retries, and making sure that steps are executed in order locally.

我们将在这里介绍的大多数算法都假设崩溃故障模型并通过引入冗余来解决故障。这些假设有助于创建性能更好、更容易理解和实现的算法。

Most of the algorithms we’re going to cover here assume the crash-failure model and work around failures by introducing redundancy. These assumptions help to create algorithms that perform better and are easier to understand and implement.

概括

Summary

在本章中,我们讨论了一些分布式系统术语并介绍了一些基本概念。我们已经讨论过由于系统组件的不可靠性而导致的固有困难和复杂性:链路可能无法传递消息,进程可能崩溃,或者网络可能分区。

In this chapter, we discussed some of the distributed systems terminology and introduced some basic concepts. We’ve talked about the inherent difficulties and complications caused by the unreliability of the system components: links may fail to deliver messages, processes may crash, or the network may get partitioned.

这个术语应该足以让我们继续讨论。本书的其余部分讨论了分布式系统中常用的解决方案:我们回想一下可能会出错的地方,并看看我们有哪些可用的选项。

This terminology should be enough for us to continue the discussion. The rest of the book talks about the solutions commonly used in distributed systems: we think back to what can go wrong and see what options we have available.

1为了简洁起见,交错(乘法器在加法器之前读取)被省略,因为它产生与 a) 相同的结果。

1 Interleaving, where the multiplier reads before the adder, is left out for brevity, since it yields the same result as a).

2墨菲定律是一句格言,可以概括为“任何可能出错的事情就一定会出错”,这句话很流行,经常被用作流行文化中的习语。

2 Murphy’s Law is an adage that can be summarized as “Anything that can go wrong, will go wrong,” which was popularized and is often used as an idiom in popular culture.

3更精确的定义是,如果正确的进程 A 无限频繁地向正确的进程 B 发送消息,则该消息将被无限频繁地传递([ CACHIN11])。

3 A more precise definition is that if a correct process A sends a message to a correct process B infinitely often, it will be delivered infinitely often ([CACHIN11]).

4请参阅https://databass.dev/links/53

4 See https://databass.dev/links/53.

第 9 章故障检测

Chapter 9. Failure Detection

如果一棵树倒在森林里,周围没有人听到,它会发出声音吗?

作者未知

If a tree falls in a forest and no one is around to hear it, does it make a sound?

Unknown Author

为了使系统能够对故障做出适当的反应,应该及时检测到故障。即使错误的进程无法响应,也可能会被联系,从而增加延迟并降低整体系统可用性。

In order for a system to appropriately react to failures, failures should be detected in a timely manner. A faulty process might get contacted even though it won’t be able to respond, increasing latencies and reducing overall system availability.

检测异步分布式系统中的故障(即,不做任何计时假设)极其困难,因为无法判断进程是否已崩溃,或者运行缓慢并需要无限长的时间来响应。我们在“FLP Impossibility”中讨论了与此相关的一个问题。

Detecting failures in asynchronous distributed systems (i.e., without making any timing assumptions) is extremely difficult as it’s impossible to tell whether the process has crashed, or is running slowly and taking an indefinitely long time to respond. We discussed a problem related to this one in “FLP Impossibility”.

条款例如deadfailedcrashed通常用于描述完全停止执行其步骤的进程。条款诸如unresponsivefaultySlow等用来描述可疑的进程,这些进程实际上可能已经死了。

Terms such as dead, failed, and crashed are usually used to describe a process that has stopped executing its steps completely. Terms such as unresponsive, faulty, and slow are used to describe suspected processes, which may actually be dead.

失败可能发生在链路级别(进程之间的消息丢失或传递缓慢),或者进程级别(进程崩溃或运行缓慢),缓慢可能并不总是与失败区分开来。这意味着错误怀疑之间总是需要权衡活着的进程被视为死亡(产生假阳性),并延迟将无响应的进程标记为死亡,给予其怀疑的好处并期望它最终响应(产生假阴性)。

Failures may occur on the link level (messages between processes are lost or delivered slowly), or on the process level (the process crashes or is running slowly), and slowness may not always be distinguishable from failure. This means there’s always a trade-off between wrongly suspecting alive processes as dead (producing false-positives), and delaying marking an unresponsive process as dead, giving it the benefit of doubt and expecting it to respond eventually (producing false-negatives).

故障检测器是本地子系统,负责识别失败或无法访问的进程,将其从算法中排除,并在保证安全性的同时保证活性。

A failure detector is a local subsystem responsible for identifying failed or unreachable processes to exclude them from the algorithm and guarantee liveness while preserving safety.

活力与安全是描述算法解决特定问题的能力及其输出的正确性的属性。更正式地说,活性是保证特定预期事件必须发生的属性。例如,如果其中一个进程失败,则故障检测器必须检测到该故障。安全保证不会发生意外事件。例如,如果故障检测器将某个进程标记为死亡,则该进程实际上必须是死亡的[LAMPORT77] [RAYNAL99] [FREILING11]

Liveness and safety are the properties that describe an algorithm’s ability to solve a specific problem and the correctness of its output. More formally, liveness is a property that guarantees that a specific intended event must occur. For example, if one of the processes has failed, a failure detector must detect that failure. Safety guarantees that unintended events will not occur. For example, if a failure detector has marked a process as dead, this process had to be, in fact, dead [LAMPORT77] [RAYNAL99] [FREILING11].

从实际角度来看,排除失败的进程有助于避免不必要的工作并防止错误传播和级联故障,同时在排除可能可疑的活动进程时降低可用性。

From a practical perspective, excluding failed processes helps to avoid unnecessary work and prevents error propagation and cascading failures, while reducing availability when excluding potentially suspected alive processes.

故障检测算法应该表现出几个基本属性。首先,每个非故障成员最终都应该注意到流程失败,并且算法应该能够取得进展并最终达到最终结果。这属性称为完整性

Failure-detection algorithms should exhibit several essential properties. First of all, every nonfaulty member should eventually notice the process failure, and the algorithm should be able to make progress and eventually reach its final result. This property is called completeness.

我们可以通过效率来判断算法的质量:故障检测器识别过程故障的速度有多快。另一种方法是查看算法的准确性:是否精确检测到过程故障。换句话说,如果算法错误地指责实时进程失败或无法检测现有的故障,则该算法是不准确的。

We can judge the quality of the algorithm by its efficiency: how fast the failure detector can identify process failures. Another way to do this is to look at the accuracy of the algorithm: whether or not the process failure was precisely detected. In other words, an algorithm is not accurate if it falsely accuses a live process of being failed or is not able to detect the existing failures.

我们可以将效率和准确性之间的关系视为一个可调参数:更高效的算法可能不太精确,而更准确的算法通常效率较低。事实证明,建立一个既准确又高效的故障检测器是不可能的。同时,故障检测器可以产生误报(即错误地将实时进程识别为故障,反之亦然)[CHANDRA96]

We can think of the relationship between efficiency and accuracy as a tunable parameter: a more efficient algorithm might be less precise, and a more accurate algorithm is usually less efficient. It is provably impossible to build a failure detector that is both accurate and efficient. At the same time, failure detectors are allowed to produce false-positives (i.e., falsely identify live processes as failed and vice versa) [CHANDRA96].

故障检测器是许多共识和原子广播算法的基本先决条件和不可或缺的一部分,我们将在本书后面讨论。

Failure detectors are an essential prerequisite and an integral part of many consensus and atomic broadcast algorithms, which we’ll be discussing later in this book.

许多分布式系统通过使用心跳来实现故障检测器。这种方法因其简单且完备性强而颇受欢迎。我们在这里讨论的算法假设不存在拜占庭式故障:进程不会试图故意隐瞒其状态或邻居的状态。

Many distributed systems implement failure detectors by using heartbeats. This approach is quite popular because of its simplicity and strong completeness. Algorithms we discuss here assume the absence of Byzantine failures: processes do not attempt to intentionally lie about their state or states of their neighbors.

心跳和 Ping

Heartbeats and Pings

我们可以通过触发两个周期性进程之一来查询远程进程的状态

We can query the state of remote processes by triggering one of two periodic processes:

  • 我们可以触发 ping,向远程进程发送消息,通过期望在指定时间段内得到响应来检查它们是否仍然存在。

  • We can trigger a ping, which sends messages to remote processes, checking if they are still alive by expecting a response within a specified time period.

  • 当进程通过向其对等方发送消息主动通知其仍在运行时,我们可以触发心跳。

  • We can trigger a heartbeat when the process is actively notifying its peers that it’s still running by sending messages to them.

我们将在这里使用 ping 作为示例,但可以使用心跳来解决相同的问题,产生类似的结果。

We’ll use pings as an example here, but the same problem can be solved using heartbeats, producing similar results.

每个进程都维护一个其他进程(活动进程、死进程和可疑进程)的列表,并用每个进程的最后响应时间更新它。如果进程长时间未能响应 ping 消息,则会被标记为可疑

Each process maintains a list of other processes (alive, dead, and suspected ones) and updates it with the last response time for each process. If a process fails to respond to a ping message for a longer time, it is marked as suspected.

图 9-1显示了系统的正常运行:进程P1正在查询相邻节点的状态P2,相邻节点以确认进行响应。

Figure 9-1 shows the normal functioning of a system: process P1 is querying the state of neighboring node P2, which responds with an acknowledgment.

数据库0901
图 9-1。用于故障检测的 Ping:正常运行,无消息延迟

相反,图 9-2显示了确认消息如何延迟,这可能导致将活动进程标记为关闭。

In contrast, Figure 9-2 shows how acknowledgment messages are delayed, which might result in marking the active process as down.

数据库0902
图 9-2。用于故障检测的 Ping:响应延迟,在发送下一条消息后出现

许多故障检测算法都是基于心跳和超时。例如,Akka(一种用于构建分布式系统的流行框架)有一个截止时间故障检测器的实现,它使用心跳并在固定时间间隔内未能注册时报告进程故障。

Many failure-detection algorithms are based on heartbeats and timeouts. For example, Akka, a popular framework for building distributed systems, has an implementation of a deadline failure detector, which uses heartbeats and reports a process failure if it has failed to register within a fixed time interval.

这种方法有几个潜在的缺点:它的精度依赖于仔细选择 ping 频率和超时,并且它不能从其他进程的角度捕获进程可见性(请参阅“外包心跳”)。

This approach has several potential downsides: its precision relies on the careful selection of ping frequency and timeout, and it does not capture process visibility from the perspective of other processes (see “Outsourced Heartbeats”).

无超时故障检测器

Timeout-Free Failure Detector

一些算法避免依赖超时来检测故障。例如,Heartbeat,一种无超时故障检测器[AGUILERA97],是一种仅对心跳进行计数的算法,并允许应用程序根据心跳计数器向量中的数据来检测进程故障。由于该算法是无超时的,因此它在异步系统假设下运行。

Some algorithms avoid relying on timeouts for detecting failures. For example, Heartbeat, a timeout-free failure detector [AGUILERA97], is an algorithm that only counts heartbeats and allows the application to detect process failures based on the data in the heartbeat counter vectors. Since this algorithm is timeout-free, it operates under asynchronous system assumptions.

算法假设任何两个正确的进程都通过公平路径相互连接,该路径仅包含公平链接(即,如果一条消息通过该链接无限频繁地发送,那么它也会无限频繁地接收),并且每个进程都知道网络中所有其他进程的存在。

The algorithm assumes that any two correct processes are connected to each other with a fair path, which contains only fair links (i.e., if a message is sent over this link infinitely often, it is also received infinitely often), and each process is aware of the existence of all other processes in the network.

每个进程都维护一个邻居列表和与其关联的计数器。进程首先向其邻居发送心跳消息。每条消息都包含心跳到目前为止所经过的路径。初始消息包含路径中的第一个发送者和可用于避免多次广播同一消息的唯一标识符。

Each process maintains a list of neighbors and counters associated with them. Processes start by sending heartbeat messages to their neighbors. Each message contains a path that the heartbeat has traveled so far. The initial message contains the first sender in the path and a unique identifier that can be used to avoid broadcasting the same message multiple times.

当进程收到新的心跳消息时,它会增加路径中存在的所有参与者的计数器,并将心跳发送到不存在的参与者,将自身附加到路径中。一旦进程发现所有已知进程都已收到消息(换句话说,进程 ID 出现在路径中),就会停止传播消息。

When the process receives a new heartbeat message, it increments counters for all participants present in the path and sends the heartbeat to the ones that are not present there, appending itself to the path. Processes stop propagating messages as soon as they see that all the known processes have already received it (in other words, process IDs appear in the path).

由于消息通过不同的进程传播,并且心跳路径包含从邻居接收到的聚合信息,因此即使两个进程之间的直接链路出现故障,我们也可以(正确地)将不可达的进程标记为活动状态。

Since messages are propagated through different processes, and heartbeat paths contain aggregated information received from the neighbors, we can (correctly) mark an unreachable process as alive even when the direct link between the two processes is faulty.

心跳计数器代表系统的全局和标准化视图。该视图捕获了心跳如何相对于彼此传播,使我们能够比较进程。然而,这种方法的缺点之一是解释心跳计数器可能非常棘手:我们需要选择一个可以产生可靠结果的阈值。除非我们能做到这一点,否则算法会将活动进程错误地标记为可疑进程。

Heartbeat counters represent a global and normalized view of the system. This view captures how the heartbeats are propagated relative to one another, allowing us to compare processes. However, one of the shortcomings of this approach is that interpreting heartbeat counters may be quite tricky: we need to pick a threshold that can yield reliable results. Unless we can do that, the algorithm will falsely mark active processes as suspected.

外包心跳

Outsourced Heartbeats

一个可扩展弱一致感染式进程组成员协议 (SWIM) [GUPTA01]使用的替代方法是使用外包心跳,通过从其邻居的角度使用有关进程活跃性的信息来提高可靠性。这种方法不需要进程知道网络中的所有其他进程,只需要知道连接的对等点的子集。

An alternative approach, used by the Scalable Weakly Consistent Infection-style Process Group Membership Protocol (SWIM) [GUPTA01] is to use outsourced heartbeats to improve reliability using information about the process liveness from the perspective of its neighbors. This approach does not require processes to be aware of all other processes in the network, only a subset of connected peers.

如图9-3所示,进程向进程发送ping消息。不响应消息,因此通过选择多个随机成员(和)继续。这些随机成员尝试向 发送心跳消息,如果响应,则将确认转发回。P1P2P2P1P3P4P2P1

As shown in Figure 9-3, process P1 sends a ping message to process P2. P2 doesn’t respond to the message, so P1 proceeds by selecting multiple random members (P3 and P4). These random members try sending heartbeat messages to P2 and, if it responds, forward acknowledgments back to P1.

数据库0903
图 9-3。“外包”心跳

这允许考虑直接和间接可达性。例如,如果我们有进程、、 和,我们可以从和的角度检查 的状态。P1P2P3P3P1P2

This allows accounting for both direct and indirect reachability. For example, if we have processes P1, P2, and P3, we can check the state of P3 from the perspective of both P1 and P2.

外包心跳通过在成员组之间分配决策责任来实现可靠的故障检测。这种方法不需要向广泛的对等组广播消息。由于外包心跳请求可以并行触发,因此这种方法可以快速收集更多有关可疑进程的信息,并使我们能够做出更准确的决策。

Outsourced heartbeats allow reliable failure detection by distributing responsibility for deciding across the group of members. This approach does not require broadcasting messages to a broad group of peers. Since outsourced heartbeat requests can be triggered in parallel, this approach can collect more information about suspected processes quickly, and allow us to make more accurate decisions.

Phi-Accural 故障检测器

Phi-Accural Failure Detector

反而将节点故障视为二元问题,其中进程只能处于两种状态:向上或向下,phi-应计(φ-应计)故障检测器[HAYASHIBARA04]具有连续的尺度,捕获受监控进程崩溃的概率。它的工作原理是维护一个滑动窗口,收集来自对等进程的最新心跳的到达时间。该信息用于估计下一次心跳的到达时间,将该近似值与实际到达时间进行比较,并计算怀疑级别 φ:在给定当前网络条件的情况下,故障检测器对故障的确定程度。

Instead of treating node failure as a binary problem, where the process can be only in two states: up or down, a phi-accrual (φ-accrual) failure detector [HAYASHIBARA04] has a continuous scale, capturing the probability of the monitored process’s crash. It works by maintaining a sliding window, collecting arrival times of the most recent heartbeats from the peer processes. This information is used to approximate arrival time of the next heartbeat, compare this approximation with the actual arrival time, and compute the suspicion level φ: how certain the failure detector is about the failure, given the current network conditions.

该算法的工作原理是收集和采样到达时间,创建一个可用于对节点健康状况做出可靠判断的视图。它使用这些样本来计算 的值φ:如果该值达到阈值,则该节点被标记为关闭。该故障检测器通过调整可将节点标记为可疑节点的范围来动态适应不断变化的网络条件。

The algorithm works by collecting and sampling arrival times, creating a view that can be used to make a reliable judgment about node health. It uses these samples to compute the value of φ: if this value reaches a threshold, the node is marked as down. This failure detector dynamically adapts to changing network conditions by adjusting the scale on which the node can be marked as a suspect.

从架构的角度来看,phi-Accrual 故障检测器可以被视为三个子系统的组合:

From the architecture perspective, a phi-accrual failure detector can be viewed as a combination of three subsystems:

监控
Monitoring

通过 ping、心跳或请求响应采样收集活动信息。

Collecting liveness information through pings, heartbeats, or request-response sampling.

解释
Interpretation

决定是否应将进程标记为可疑。

Making a decision on whether or not the process should be marked as suspected.

行动
Action

每当进程被标记为可疑时执行的回调。

A callback executed whenever the process is marked as suspected.

监视过程在固定大小的心跳到达时间窗口中收集并存储数据样本(假设遵循正态分布)。较新到达的心跳数据点将被添加到窗口中,而最旧的心跳数据点将被丢弃。

The monitoring process collects and stores data samples (which are assumed to follow a normal distribution) in a fixed-size window of heartbeat arrival times. Newer arrivals are added to the window, and the oldest heartbeat data points are discarded.

通过确定样本的均值和方差,从采样窗口估计分布参数。该信息用于计算t前一条消息之后的时间单位内消息到达的概率。有了这些信息,我们就可以计算φ,它描述了我们对进程的活跃度做出正确决策的可能性。换句话说,犯错误并收到与计算假设相矛盾的心跳的可能性有多大。

Distribution parameters are estimated from the sampling window by determining the mean and variance of samples. This information is used to compute the probability of arrival of the message within t time units after the previous one. Given this information, we compute φ, which describes how likely we are to make a correct decision about a process’s liveness. In other words, how likely it is to make a mistake and receive a heartbeat that will contradict the calculated assumptions.

这种方法是由日本先进科学技术研究所的研究人员开发的,现在被用于许多分布式系统中;例如,CassandraAkka(以及前面提到的截止日期故障检测器)。

This approach was developed by researchers from the Japan Advanced Institute of Science and Technology, and is now used in many distributed systems; for example, Cassandra and Akka (along with the aforementioned deadline failure detector).

八卦和故障检测

Gossip and Failure Detection

其他避免依赖单节点视图做出决策的方法是八卦式故障检测服务[VANRENESSE98],它使用八卦(参见“八卦传播”)来收集和分发相邻进程的状态。

Another approach that avoids relying on a single-node view to make a decision is a gossip-style failure detection service [VANRENESSE98], which uses gossip (see “Gossip Dissemination”) to collect and distribute states of neighboring processes.

每个成员维护其他成员、其心跳计数器和时间戳的列表,指定心跳计数器最后一次递增的时间。每个成员定期增加其心跳计数器并将其列表分发给随机邻居。收到消息后,相邻节点将列表与自己的列表合并,更新其他邻居的心跳计数器。

Each member maintains a list of other members, their heartbeat counters, and timestamps, specifying when the heartbeat counter was incremented for the last time. Periodically, each member increments its heartbeat counter and distributes its list to a random neighbor. Upon the message receipt, the neighboring node merges the list with its own, updating heartbeat counters for the other neighbors.

节点还定期检查状态列表和心跳计数器。如果任何节点在足够长的时间内没有更新其计数器,则被视为失败。应仔细选择该超时期限,以尽量减少误报的可能性。成员之间必须相互通信的频率(换句话说,最坏情况下的带宽)是有上限的,并且最多可以随着系统中进程的数量线性增长。

Nodes also periodically check the list of states and heartbeat counters. If any node did not update its counter for long enough, it is considered failed. This timeout period should be chosen carefully to minimize the probability of false-positives. How often members have to communicate with each other (in other words, worst-case bandwidth) is capped, and can grow at most linearly with a number of processes in the system.

图 9-4显示了三个共享心跳计数器的通信进程:

Figure 9-4 shows three communicating processes sharing their heartbeat counters:

  • a) 所有三个都可以通信并更新它们的时间戳。

  • a) All three can communicate and update their timestamps.

  • b)P3无法与 通信P1,但其时间戳仍然可以通过 传播。t6P2

  • b) P3 isn’t able to communicate with P1, but its timestamp t6 can still be propagated through P2.

  • c)P3崩溃。由于它不再发送更新,因此被其他进程检测为失败。

  • c) P3 crashes. Since it doesn’t send updates anymore, it is detected as failed by other processes.

数据库0904
图 9-4。用于故障检测的复制心跳表

这样,我们可以检测崩溃的节点,以及任何其他集群成员无法访问的节点。这个决定是可靠的,因为集群的视图是来自多个节点的聚合。如果两台主机之间出现链路故障,心跳仍然可以通过其他进程传播。使用八卦来传播系统状态会增加系统中的消息数量,但允许信息更可靠地传播。

This way, we can detect crashed nodes, as well as the nodes that are unreachable by any other cluster member. This decision is reliable, since the view of the cluster is an aggregate from multiple nodes. If there’s a link failure between the two hosts, heartbeats can still propagate through other processes. Using gossip for propagating system states increases the number of messages in the system, but allows information to spread more reliably.

逆转故障检测问题陈述

Reversing Failure Detection Problem Statement

自从传播有关故障的信息并不总是可能的,并且通过通知每个成员来传播它可能会很昂贵,其中一种方法称为FUSE(故障通知服务)[DUNAGAN04],专注于可靠且廉价的故障传播,即使在以下情况下也能工作网络分区。

Since propagating the information about failures is not always possible, and propagating it by notifying every member might be expensive, one of the approaches, called FUSE (failure notification service) [DUNAGAN04], focuses on reliable and cheap failure propagation that works even in cases of network partitions.

为了检测进程故障,此方法将所有活动进程分组。如果其中一组不可用,所有参与者都会检测到故障。换句话说,每次检测到单个进程故障时,都会对其进行转换和传播作为一个群体的失败。这允许在存在任何模式的断开连接、分区和节点故障时检测故障。

To detect process failures, this approach arranges all active processes in groups. If one of the groups becomes unavailable, all participants detect the failure. In other words, every time a single process failure is detected, it is converted and propagated as a group failure. This allows detecting failures in the presence of any pattern of disconnects, partitions, and node failures.

组中的进程定期向其他成员发送 ping 消息,查询它们是否还活着。如果其中一个成员由于崩溃、网络分区或链路故障而无法响应此消息,则发起此 ping 的成员将依次停止响应 ping 消息本身。

Processes in the group periodically send ping messages to other members, querying whether they’re still alive. If one of the members cannot respond to this message because of a crash, network partition, or link failure, the member that has initiated this ping will, in turn, stop responding to ping messages itself.

图9-5显示了四个通信流程:

Figure 9-5 shows four communicating processes:

  • a) 初始状态:所有进程都处于活动状态并且可以通信。

  • a) Initial state: all processes are alive and can communicate.

  • b)崩溃并停止响应 ping 消息。P2

  • b) P2 crashes and stops responding to ping messages.

  • c)检测 ping 消息本身的故障并停止响应。P4P2

  • c) P4 detects the failure of P2 and stops responding to ping messages itself.

  • d) 最终,请注意, 和 都没有响应,并且进程失败会传播到整个组。P1P3P1P2

  • d) Eventually, P1 and P3 notice that both P1 and P2 do not respond, and process failure propagates to the entire group.

数据库0905
图 9-5。保险丝故障检测

所有故障都会通过系统从故障源传播到所有其他参与者。参与者逐渐停止响应 ping,从单个节点故障转变为组故障。

All failures are propagated through the system from the source of failure to all other participants. Participants gradually stop responding to pings, converting from the individual node failure to the group failure.

在这里,我们使用缺乏通信作为传播手段。使用这种方法的一个优点是保证每个成员都能了解团队失败并对其做出充分反应。缺点之一是,将单个进程与其他进程分开的链路故障也可以转换为组故障,但这可以被视为优点,具体取决于用例。应用程序可以使用自己的传播故障定义来解决这种情况。

Here, we use the absence of communication as a means of propagation. An advantage of using this approach is that every member is guaranteed to learn about group failure and adequately react to it. One of the downsides is that a link failure separating a single process from other ones can be converted to the group failure as well, but this can be seen as an advantage, depending on the use case. Applications can use their own definitions of propagated failures to account for this scenario.

概括

Summary

故障检测器是任何分布式系统的重要组成部分。正如 FLP Impossibility 结果所示,没有协议可以保证异步系统中的共识。故障检测器有助于增强模型,使我们能够通过在准确性和完整性之间进行权衡来解决共识问题。[CHANDRA96]中描述了该领域的一项重要发现,证明了故障检测器的有用性,这表明即使故障检测器犯了无数次错误,也可以解决共识。

Failure detectors are an essential part of any distributed system. As shown by the FLP Impossibility result, no protocol can guarantee consensus in an asynchronous system. Failure detectors help to augment the model, allowing us to solve a consensus problem by making a trade-off between accuracy and completeness. One of the significant findings in this area, proving the usefulness of failure detectors, was described in [CHANDRA96], which shows that solving consensus is possible even with a failure detector that makes an infinite number of mistakes.

我们已经介绍了几种故障检测算法,每种算法都使用不同的方法:有些侧重于通过直接通信来检测故障,有些使用广播或八卦来传播信息,有些则通过使用静止(换句话说,没有通信)作为传播手段。我们现在知道我们可以使用心跳或 ping、硬期限或连续尺度。这些方法中的每一种都有其自身的优点:简单、准确或精确。

We’ve covered several algorithms for failure detection, each using a different approach: some focus on detecting failures by direct communication, some use broadcast or gossip for spreading the information around, and some opt out by using quiescence (in other words, absence of communication) as a means of propagation. We now know that we can use heartbeats or pings, hard deadlines, or continuous scales. Each one of these approaches has its own upsides: simplicity, accuracy, or precision.

第 10 章领导者选举

Chapter 10. Leader Election

同步成本可能相当高:如果每个算法步骤都涉及联系每个其他参与者,我们最终可能会产生巨大的通信开销。在大型且地理分布的网络中尤其如此。为了减少同步开销和达成决策所需的消息往返次数,一些算法依赖于 领导者(有时称为协调者)进程,负责执行或协调分布式算法的步骤。

Synchronization can be quite costly: if each algorithm step involves contacting each other participant, we can end up with a significant communication overhead. This is particularly true in large and geographically distributed networks. To reduce synchronization overhead and the number of message round-trips required to reach a decision, some algorithms rely on the existence of the leader (sometimes called coordinator) process, responsible for executing or coordinating steps of a distributed algorithm.

一般来说,分布式系统中的进程是统一的,任何进程都可以担任主导角色。流程在很长一段时间内占据领导地位,但这并不是一个永久的角色。通常,进程在崩溃之前一直是领导者。崩溃后,任何其他进程都可以开始新一轮选举,如果当选,则承担领导责任,并继续失败领导者的工作。

Generally, processes in distributed systems are uniform, and any process can take over the leadership role. Processes assume leadership for long periods of time, but this is not a permanent role. Usually, the process remains a leader until it crashes. After the crash, any other process can start a new election round, assume leadership, if it gets elected, and continue the failed leader’s work.

选举算法的活跃性保证了大多数时候都会有一个领导者,并且选举最终会完成(即系统不应该无限期地处于选举状态)。

The liveness of the election algorithm guarantees that most of the time there will be a leader, and the election will eventually complete (i.e., the system should not be in the election state indefinitely).

理想情况下,我们会也喜欢假设安全,并保证一次最多有一个领导者,并完全消除脑裂情况的可能性(当两个服务于相同目的的领导者被选举出来但彼此不知道)。然而,在实践中,许多领导者选举算法违反了这一协议。

Ideally, we’d like to assume safety, too, and guarantee there may be at most one leader at a time, and completely eliminate the possibility of a split brain situation (when two leaders serving the same purpose are elected but unaware of each other). However, in practice, many leader election algorithms violate this agreement.

例如,领导者进程可用于实现广播中消息的全序。领导者收集并保存全局状态,接收消息,并在进程之间传播它们。它还可用于在故障后、初始化期间或发生重要状态更改时协调系统重组。

Leader processes can be used, for example, to achieve a total order of messages in a broadcast. The leader collects and holds the global state, receives messages, and disseminates them among the processes. It can also be used to coordinate system reorganization after the failure, during initialization, or when important state changes happen.

当系统初始化,第一次选举出Leader,或者前一个Leader崩溃或无法通信时,会触发选举。选举必须是确定性的:选举过程中必须选出一位领导者。该决定需要对所有参与者都有效。

Election is triggered when the system initializes, and the leader is elected for the first time, or when the previous leader crashes or fails to communicate. Election has to be deterministic: exactly one leader has to emerge from the process. This decision needs to be effective for all participants.

尽管从理论角度来看,领导者选举和分布式锁定(即共享资源的独占所有权)可能看起来很相似,但它们还是略有不同。如果一个进程持有用于执行关键部分的锁,那么其他进程知道现在到底谁持有锁并不重要,只要满足活性属性(即锁最终将被释放,允许其他进程获取它)。相比之下,当选的进程具有一些特殊属性,并且必须为所有其他参与者所知,因此新当选的领导者必须向其同伴通知其角色。

Even though leader election and distributed locking (i.e., exclusive ownership over a shared resource) might look alike from a theoretical perspective, they are slightly different. If one process holds a lock for executing a critical section, it is unimportant for other processes to know who exactly is holding a lock right now, as long as the liveness property is satisfied (i.e., the lock will be eventually released, allowing others to acquire it). In contrast, the elected process has some special properties and has to be known to all other participants, so the newly elected leader has to notify its peers about its role.

如果分布式锁定算法对某个进程或进程组有任何类型的偏好,它最终将使非偏好进程无法获得共享资源,这与活性属性相矛盾。相比之下,领导者可以一直担任自己的角色,直到停止或崩溃,并且长寿的领导者是首选。

If a distributed locking algorithm has any sort of preference toward some process or group of processes, it will eventually starve nonpreferred processes from the shared resource, which contradicts the liveness property. In contrast, the leader can remain in its role until it stops or crashes, and long-lived leaders are preferred.

系统中拥有稳定的领导者有助于避免远程参与者之间的状态同步,减少交换消息的数量,并从单个进程驱动执行,而不需要点对点协调。具有领导概念的系统中的潜在问题之一是领导者可能成为瓶颈。为了克服这个问题,许多系统将数据划分为不相交的独立副本集(请参阅“数据库分区”)。每个副本集都有自己的领导者,而不是单个系统范围的领导者。使用这种方法的系统之一是 Spanner(请参阅“使用 Spanner 进行分布式事务”)。

Having a stable leader in the system helps to avoid state synchronization between remote participants, reduce the number of exchanged messages, and drive execution from a single process instead of requiring peer-to-peer coordination. One of the potential problems in systems with a notion of leadership is that the leader can become a bottleneck. To overcome that, many systems partition data in non-intersecting independent replica sets (see “Database Partitioning”). Instead of having a single system-wide leader, each replica set has its own leader. One of the systems that uses this approach is Spanner (see “Distributed Transactions with Spanner”).

因为每个领导者进程最终都会失败,所以必须检测、报告失败并做出反应:系统必须选举另一位领导者来取代失败的领导者。

Because every leader process will eventually fail, failure has to be detected, reported, and reacted upon: a system has to elect another leader to replace the failed one.

一些算法,例如 ZAB(参见“Zookeeper Atomic Broadcast (ZAB)”)、Multi-Paxos(参见“Multi-Paxos”)或 Raft(参见“Raft”),使用临时领导者来减少执行任务所需的消息数量。参与者之间达成协议。然而,这些算法使用自己的特定于算法的方法来进行领导者选举、故障检测和解决竞争领导者进程之间的冲突。

Some algorithms, such as ZAB (see “Zookeeper Atomic Broadcast (ZAB)”), Multi-Paxos (see “Multi-Paxos”), or Raft (see “Raft”), use temporary leaders to reduce the number of messages required to reach an agreement between the participants. However, these algorithms use their own algorithm-specific means for leader election, failure detection, and resolving conflicts between the competing leader processes.

恶霸算法

Bully Algorithm

领导者选举算法的一种,称为欺凌算法,使用进程排名来识别新的领导者。每个进程都会分配一个唯一的等级。在选举过程中,排名最高的进程成为领导者[MOLINA82]

One of the leader election algorithms, known as the bully algorithm, uses process ranks to identify the new leader. Each process gets a unique rank assigned to it. During the election, the process with the highest rank becomes a leader [MOLINA82].

该算法以其简单性而闻名。该算法被命名为“欺负”,因为排名最高的节点“欺负”其他节点接受它。也是被称为君主领袖选举:在前一位不再存在后,排名最高的兄弟姐妹成为君主。

This algorithm is known for its simplicity. The algorithm is named bully because the highest-ranked node “bullies” other nodes into accepting it. It is also known as monarchial leader election: the highest-ranked sibling becomes a monarch after the previous one ceases to exist.

如果其中一个进程注意到系统中没有领导者(它从未初始化)或前一个领导者已停止响应请求,则选举开始,并分三个步骤进行:1

Election starts if one of the processes notices that there’s no leader in the system (it was never initialized) or the previous leader has stopped responding to requests, and proceeds in three steps:1

  1. 该进程向具有更高标识符的进程发送选举消息。

  2. The process sends election messages to processes with higher identifiers.

  3. 该进程等待,允许更高级别的进程响应。如果没有更高级别的进程响应,则继续执行步骤 3。否则,该进程会通知它收到的最高级别的进程,并允许其继续执行步骤 3。

  4. The process waits, allowing higher-ranked processes to respond. If no higher-ranked process responds, it proceeds with step 3. Otherwise, the process notifies the highest-ranked process it has heard from, and allows it to proceed with step 3.

  5. 该进程假定不存在具有较高级别的活动进程,并通知所有较低级别的进程有关新领导者的信息。

  6. The process assumes that there are no active processes with a higher rank, and notifies all lower-ranked processes about the new leader.

图 10-1说明了恶霸领导者选举算法:

Figure 10-1 illustrates the bully leader election algorithm:

  • a) 进程3注意到前一个领导者已经崩溃,并通过向具有更高标识符的进程6发送消息来开始新的选举。Election

  • a) Process 3 notices that the previous leader 6 has crashed and starts a new election by sending Election messages to processes with higher identifiers.

  • b)45回应Alive,因为它们的排名高于3

  • b) 4 and 5 respond with Alive, as they have a higher rank than 3.

  • c)通知在本轮中做出响应的3最高级别的进程。5

  • c) 3 notifies the highest-ranked process 5 that has responded during this round.

  • d)5当选为新的领导者。它广播Elected消息,通知级别较低的进程有关选举结果的信息。

  • d) 5 is elected as a new leader. It broadcasts Elected messages, notifying lower-ranked processes about the election results.

数据库1001
图 10-1。Bully算法:前一个领导者(6)失败,进程3开始新的选举

该算法的一个明显问题是,在存在网络分区的情况下,它违反了安全保证(一次最多只能选举一名领导者)。很容易出现这样的情况:节点被分成两个或多个独立运行的子集,并且每个子集都选举其领导者。这这种情况叫做脑裂

One of the apparent problems with this algorithm is that it violates the safety guarantee (that at most one leader can be elected at a time) in the presence of network partitions. It is quite easy to end up in the situation where nodes get split into two or more independently functioning subsets, and each subset elects its leader. This situation is called split brain.

该算法的另一个问题是对高排名节点的强烈偏好,如果它们不稳定,这就会成为一个问题,并可能导致永久的重新选举状态。一个不稳定的高排名节点提出自己作为领导者,不久之后失败,赢得连任,再次失败,整个过程重复。这个问题可以通过分配主机质量指标并在选举期间考虑它们来解决。

Another problem with this algorithm is a strong preference toward high-ranked nodes, which becomes an issue if they are unstable and can lead to a permanent state of reelection. An unstable high-ranked node proposes itself as a leader, fails shortly thereafter, wins reelection, fails again, and the whole process repeats. This problem can be solved by distributing host quality metrics and taking them into consideration during the election.

下一个在线故障转移

Next-In-Line Failover

那里有许多版本的恶霸算法可以改进其各种属性。例如,我们可以使用多个下一个在线替代流程作为故障转移来缩短重新选举[GHOLIPOUR09]

There are many versions of the bully algorithm that improve its various properties. For example, we can use multiple next-in-line alternative processes as a failover to shorten reelections [GHOLIPOUR09].

每个当选的领导者都会提供一个故障转移节点列表。当其中一个进程检测到领导者失败时,它会通过向失败领导者提供的列表中排名最高的替代者发送消息来开始新一轮选举。如果提议的替代方案之一成立,它将成为新的领导者,而无需经历完整的选举轮。

Each elected leader provides a list of failover nodes. When one of the processes detects a leader failure, it starts a new election round by sending a message to the highest-ranked alternative from the list provided by the failed leader. If one of the proposed alternatives is up, it becomes a new leader without having to go through the complete election round.

如果检测到领导者故障的进程本身就是列表中排名最高的进程,它可以立即通知进程有关新领导者的信息。

If the process that has detected the leader failure is itself the highest ranked process from the list, it can notify the processes about the new leader right away.

图 10-2显示了进行此优化后的流程:

Figure 10-2 shows the process with this optimization in place:

  • a) 6,一个有指定替代方案的领导者{5,4}崩溃了。3注意到此故障并联系5,这是列表中排名最高的替代方案。

  • a) 6, a leader with designated alternatives {5,4}, crashes. 3 notices this failure and contacts 5, the alternative from the list with the highest rank.

  • b)5响应3它处于活动状态,以防止它联系替代列表中的其他节点。

  • b) 5 responds to 3 that it’s alive to prevent it from contacting other nodes from the alternatives list.

  • c)5通知其他节点它是新的领导者。

  • c) 5 notifies other nodes that it’s a new leader.

数据库1002
图 10-2。带故障转移的欺凌算法:前一个领导者 (6) 失败,进程 3 通过联系排名最高的替代者开始新的选举

因此,如果下一个流程有效,我们在选举期间需要的步骤就会更少。

As a result, we require fewer steps during the election if the next-in-line process is alive.

候选/普通优化

Candidate/Ordinary Optimization

其他该算法试图通过将节点分为候选普通两个子集来降低对消息数量的要求,其中只有一个候选节点最终可以成为领导者[MURSHED12]

Another algorithm attempts to lower requirements on the number of messages by splitting the nodes into two subsets, candidate and ordinary, where only one of the candidate nodes can eventually become a leader [MURSHED12].

普通流程通过联系候选节点、收集候选节点的响应、选择排名最高的存活候选节点作为新的领导者来启动选举,然后将选举结果通知其余节点。

The ordinary process initiates election by contacting candidate nodes, collecting responses from them, picking the highest-ranked alive candidate as a new leader, and then notifying the rest of the nodes about the election results.

为了解决多个同时选举的问题,该算法建议使用决胜变量δ,这是一种特定于流程的延迟,在节点之间变化很大,允许其中一个节点在其他节点之前启动选举。决胜局时间通常大于消息往返时间。优先级较高的节点具有较低的δ,反之亦然。

To solve the problem with multiple simultaneous elections, the algorithm proposes to use a tiebreaker variable δ, a process-specific delay, varying significantly between the nodes, that allows one of the nodes to initiate the election before the other ones. The tiebreaker time is generally greater than the message round-trip time. Nodes with higher priorities have a lower δ, and vice versa.

图10-3展示了选举过程的步骤:

Figure 10-3 shows the steps of the election process:

  • a)4普通集进程注意到领导进程失败6。它通过联系候选集中的所有剩余进程来开始新一轮的选举。

  • a) Process 4 from the ordinary set notices the failure of leader process 6. It starts a new election round by contacting all remaining processes from the candidate set.

  • b) 候选进程响应以通知4它们仍然存活。

  • b) Candidate processes respond to notify 4 that they’re still alive.

  • c)4通知所有进程有关新领导者的信息:2.

  • c) 4 notifies all processes about the new leader: 2.

数据库1003
图 10-3。候选者/普通修改恶霸算法:前一个领导者(6)失败,进程4开始新的选举

邀请算法

Invitation Algorithm

一个 邀请算法允许进程“邀请”其他进程加入其组,而不是试图超越它们。根据定义,该算法允许多个领导者,因为每个组都有自己的领导者。

An invitation algorithm allows processes to “invite” other processes to join their groups instead of trying to outrank them. This algorithm allows multiple leaders by definition, since each group has its own leader.

每个进程作为一个新组的领导者开始,其中唯一的成员是进程本身。小组领导联系不属于其小组的同伴,邀请他们加入。如果对等进程本身就是领导者,则两个组将被合并。否则,被联系的进程会以组长 ID 进行响应,从而允许两个组长以更少的步骤建立联系并合并组。

Each process starts as a leader of a new group, where the only member is the process itself. Group leaders contact peers that do not belong to their groups, inviting them to join. If the peer process is a leader itself, two groups are merged. Otherwise, the contacted process responds with a group leader ID, allowing two group leaders to establish contact and merge groups in fewer steps.

图10-4展示了邀请算法的执行步骤:

Figure 10-4 shows the execution steps of the invitation algorithm:

  • a) 四个进程作为每个包含一名成员的组的领导者开始。1邀请2加入其群组,并3邀请4加入其群组。

  • a) Four processes start as leaders of groups containing one member each. 1 invites 2 to join its group, and 3 invites 4 to join its group.

  • b)2加入进程 的组1,并4加入进程 的组31,第一组的领导者,联系人3,另一组的领导者。其余组成员(4在本例中为 )会收到有关新组组长的通知。

  • b) 2 joins a group with process 1, and 4 joins a group with process 3. 1, the leader of the first group, contacts 3, the leader of the other group. Remaining group members (4, in this case) are notified about the new group leader.

  • c) 两个组合并并1成为扩展组的领导者。

  • c) Two groups are merged and 1 becomes a leader of an extended group.

数据库1004
图 10-4。邀请算法

由于组已合并,因此建议组合并的进程成为新领导者还是另一个进程成为新领导者并不重要。为了将合并组所需的消息数量保持在最低限度,较大组的领导者可以成为新组的领导者。这样,只有较小组的进程才需要收到有关领导者变更的通知。

Since groups are merged, it doesn’t matter whether the process that suggested the group merge becomes a new leader or the other one does. To keep the number of messages required to merge groups to a minimum, a leader of a larger group can become a leader for a new group. This way only the processes from the smaller group have to be notified about the change of leader.

与其他讨论的算法类似,该算法允许进程定居在多个组中并具有多个领导者。邀请算法允许创建进程组并合并它们,而无需从头开始触发新的选举,从而减少了完成选举所需的消息数量。

Similar to the other discussed algorithms, this algorithm allows processes to settle in multiple groups and have multiple leaders. The invitation algorithm allows creating process groups and merging them without having to trigger a new election from scratch, reducing the number of messages required to finish the election.

环形算法

Ring Algorithm

环算法[CHANG79],系统中的所有节点形成一个环并且知道环拓扑(即它们在环中的前驱和后继)。当进程检测到领导者失败时,就会开始新的选举。选举消息在环上转发:每个进程都联系其后继节点(环中距离它最近的下一个节点)。如果该节点不可用,则进程会跳过该不可达节点并尝试联系环中该节点之后的节点,直到最终其中一个节点做出响应。

In the ring algorithm [CHANG79], all nodes in the system form a ring and are aware of the ring topology (i.e., their predecessors and successors in the ring). When the process detects the leader failure, it starts the new election. The election message is forwarded across the ring: each process contacts its successor (the next node closest to it in the ring). If this node is unavailable, the process skips the unreachable node and attempts to contact the nodes after it in the ring, until eventually one of them responds.

节点联系它们的兄弟节点,沿着环走并收集活动节点集,将自己添加到集合中,然后将其传递到下一个节点,类似于“无超时故障检测器”中描述的故障检测算法,其中节点附加在将其传递到下一个节点之前,将其标识符添加到路径中。

Nodes contact their siblings, following around the ring and collecting the live node set, adding themselves to the set before passing it over to the next node, similar to the failure-detection algorithm described in “Timeout-Free Failure Detector”, where nodes append their identifiers to the path before passing it to the next node.

该算法通过完全遍历环来进行。当消息返回到开始选举的节点时,活动集中排名最高的节点被选为领导者。在图10-5中,你可以看到这样一个遍历的例子:

The algorithm proceeds by fully traversing the ring. When the message comes back to the node that started the election, the highest-ranked node from the live set is chosen as a leader. In Figure 10-5, you can see an example of such a traversal:

  • a) 前一个领导者6已经失败,并且每个进程都从其角度查看环。

  • a) Previous leader 6 has failed and each process has a view of the ring from its perspective.

  • b)3通过开始遍历来发起一轮选举。在每一步中,到目前为止,路径上都会遍历一组节点。5无法到达6,因此它会跳过它并直接前往1

  • b) 3 initiates an election round by starting traversal. On each step, there’s a set of nodes traversed on the path so far. 5 can’t reach 6, so it skips it and goes straight to 1.

  • c) 由于5是排名最高的节点,3因此发起另一轮消息,分发有关新领导者的信息。

  • c) Since 5 was the node with the highest rank, 3 initiates another round of messages, distributing the information about the new leader.

数据库1005
图 10-5。环形算法:前任领导者(6)失败,3开始选举过程

该算法的变体包括收集单个排名最高的标识符而不是一组活动节点以节省空间:由于该max函数是可交换的,因此知道当前最大值就足够了。当算法返回到已开始选举的节点时,最后已知的最高标识符再次在环上循环。

Variants of this algorithm include collecting a single highest-ranked identifier instead of a set of active nodes to save space: since the max function is commutative, it is enough to know a current maximum. When the algorithm comes back to the node that has started the election, the last known highest identifier is circulated across the ring once again.

由于环可以分为两个或多个部分,每个部分都可能选举自己的领导者,因此这种方法也不具备安全性。

Since the ring can be partitioned in two or more parts, with each part potentially electing its own leader, this approach doesn’t hold a safety property, either.

正如您所看到的,为了使具有领导者的系统正常运行,我们需要知道当前领导者的状态(无论它是否还活着),因为为了保持流程有序并继续执行,领导者必须是活着并且可以履行其职责。为了检测领导者崩溃,我们可以使用故障检测算法(参见第 9 章)。

As you can see, for a system with a leader to function correctly, we need to know the status of the current leader (whether it is alive or not), since to keep processes organized and for execution to continue, the leader has to be alive and reachable to perform its duties. To detect leader crashes, we can use failure-detection algorithms (see Chapter 9).

概括

Summary

领导者选举是分布式系统中的一个重要主题,因为使用指定的领导者有助于减少协调开销并提高算法的性能。选举轮次可能成本高昂,但由于其频率较低,因此不会对整体系统性能产生负面影响。单个领导者可能会成为瓶颈,但大多数时候,这是通过对数据进行分区并使用每个分区的领导者或使用不同的领导者执行不同的操作来解决的。

Leader election is an important subject in distributed systems, since using a designated leader helps to reduce coordination overhead and improve the algorithm’s performance. Election rounds might be costly but, since they’re infrequent, they do not have a negative impact on the overall system performance. A single leader can become a bottleneck, but most of the time this is solved by partitioning data and using per-partition leaders or using different leaders for different actions.

不幸的是,我们在本章中讨论的所有算法都容易出现裂脑问题:我们最终可能会在独立子网中出现两个不知道彼此存在的领导者。为了避免脑裂,我们必须获得集群范围内的多数票。

Unfortunately, all the algorithms we’ve discussed in this chapter are prone to the split brain problem: we can end up with two leaders in independent subnets that are not aware of each other’s existence. To avoid split brain, we have to obtain a cluster-wide majority of votes.

许多共识算法,包括 Multi-Paxos 和 Raft,都依赖领导者进行协调。但领导人选举不就等于共识吗?为了选举一个领导者,我们需要就其身份达成共识。如果我们能够就领导者身份达成共识,我们就可以使用相同的方式就其他任何事情达成共识[ABRAHAM13]

Many consensus algorithms, including Multi-Paxos and Raft, rely on a leader for coordination. But isn’t leader election the same as consensus? To elect a leader, we need to reach a consensus about its identity. If we can reach consensus about the leader identity, we can use the same means to reach consensus on anything else [ABRAHAM13].

领导者的身份可能会在进程不知道的情况下发生变化,因此问题是关于领导者的进程局部知识是否仍然有效。为了实现这一目标,我们需要将领导者选举与故障检测结合起来。例如, 稳定领导者选举算法使用具有唯一稳定领导者的轮次和基于超时的故障检测来保证领导者可以在不崩溃且可访问的情况下保持其位置[AGUILERA01]

The identity of a leader may change without processes knowing about it, so the question is whether the process-local knowledge about the leader is still valid. To achieve that, we need to combine leader election with failure detection. For example, the stable leader election algorithm uses rounds with a unique stable leader and timeout-based failure detection to guarantee that the leader can retain its position for as long as it doesn’t crash and is accessible [AGUILERA01].

依赖领导者选举的算法通常允许多个领导者的存在,并试图尽快解决领导者之间的冲突。例如,对于 Multi-Paxos 来说就是如此(参见“Multi-Paxos”),其中只有两个冲突的领导者(提议者)之一可以继续进行,并且这些冲突通过收集第二个法定人数来解决,保证来自两个的值不同的提议者将不会被接受。

Algorithms that rely on leader election often allow the existence of multiple leaders and attempt to resolve conflicts between the leaders as quickly as possible. For example, this is true for Multi-Paxos (see “Multi-Paxos”), where only one of the two conflicting leaders (proposers) can proceed, and these conflicts are resolved by collecting a second quorum, guaranteeing that the values from two different proposers won’t be accepted.

在 Raft 中(参见“Raft”),领导者可以发现其任期已过时,这意味着系统中存在不同的领导者,并将其任期更新为较新的任期。

In Raft (see “Raft”), a leader can discover that its term is out-of-date, which implies the presence of a different leader in the system, and update its term to the more recent one.

在这两种情况下,拥有领导者是确保活力的一种方式(如果当前领导者失败了,我们需要一个新的领导者),并且流程不应该无限期地花费很长时间来了解它是否真的失败了。缺乏安全性并允许多个领导者是一种性能优化:算法可以继续进行复制阶段,并通过检测和解决冲突来保证安全性。

In both cases, having a leader is a way to ensure liveness (if the current leader has failed, we need a new one), and processes should not take indefinitely long to understand whether or not it has really failed. Lack of safety and allowing multiple leaders is a performance optimization: algorithms can proceed with a replication phase, and safety is guaranteed by detecting and resolving the conflicts.

我们将在第 14 章中更详细地讨论共识背景下的共识和领导者选举。

We discuss consensus and leader election in the context of consensus in more detail in Chapter 14.

1这些步骤描述了修改后的恶霸选举算法[KORDAFSHARI05],因为它更加紧凑和清晰。

1 These steps describe the modified bully election algorithm [KORDAFSHARI05] as it’s more compact and clear.

第 11 章复制和一致性

Chapter 11. Replication and Consistency

我们继续讨论共识和原子承诺算法,让我们把深入理解它们所需的最后一块放在一起:一致性模型。一致性模型很重要,因为它们解释了存在多个数据副本时系统的可见性语义和行为。

Before we move on to discuss consensus and atomic commitment algorithms, let’s put together the last piece required for their in-depth understanding: consistency models. Consistency models are important, since they explain visibility semantics and behavior of the system in the presence of multiple copies of data.

容错能力系统的一种属性,在其组件出现故障时仍能继续正常运行。使系统具有容错能力并不是一件容易的事,并且向现有系统添加容错能力可能会很困难。主要目标是消除系统中的单点故障,并确保关键任务组件具有冗余。通常,冗余对于用户来说是完全透明的。

Fault tolerance is a property of a system that can continue operating correctly in the presence of failures of its components. Making a system fault-tolerant is not an easy task, and it may be difficult to add fault tolerance to the existing system. The primary goal is to remove a single point of failure from the system and make sure that we have redundancy in mission-critical components. Usually, redundancy is entirely transparent for the user.

系统可以通过存储多个数据副本来继续正确运行,以便当其中一台机器发生故障时,另一台机器可以充当故障转移。在具有单一事实来源的系统中(例如,主数据库/副本数据库),可以通过将副本提升为新的主数据库来显式完成故障转移。其他系统不需要显式重新配置,并通过在读写查询期间收集多个参与者的响应来确保一致性。

A system can continue operating correctly by storing multiple copies of data so that, when one of the machines fails, the other one can serve as a failover. In systems with a single source of truth (for example, primary/replica databases), failover can be done explicitly, by promoting a replica to become a new master. Other systems do not require explicit reconfiguration and ensure consistency by collecting responses from multiple participants during read and write queries.

数据复制一种通过在系统中维护多个数据副本来引入冗余的方法。然而,由于以原子方式更新数据的多个副本是一个等同于共识[MILOSEVIC11]的问题,因此对数据库中的每个操作执行此操作的成本可能相当高。我们可以探索一些更具成本效益和灵活的方法,使数据从用户的角度看起来一致,同时允许参与者之间存在一定程度的分歧。

Data replication is a way of introducing redundancy by maintaining multiple copies of data in the system. However, since updating multiple copies of data atomically is a problem equivalent to consensus [MILOSEVIC11], it might be quite costly to perform this operation for every operation in the database. We can explore some more cost-effective and flexible ways to make data look consistent from the user’s perspective, while allowing some degree of divergence between participants.

复制在多数据中心部署中尤其重要。在这种情况下,异地复制有多种用途:它通过提供冗余来提高可用性以及承受一个或多个数据中心故障的能力。它还可以通过将数据副本放置在物理上更靠近客户端的位置来帮助减少延迟。

Replication is particularly important in multidatacenter deployments. Geo-replication, in this case, serves multiple purposes: it increases availability and the ability to withstand a failure of one or more datacenters by providing redundancy. It can also help to reduce the latency by placing a copy of data physically closer to the client.

当数据记录被修改时,其副本必须相应更新。在谈论复制时,我们最关心三个事件:写入副本更新读取。这些操作会触发客户端发起的一系列事件。在某些情况下,从客户端的角度来看,更新副本可能会在写入完成后发生,但这仍然不会改变客户端必须能够以特定顺序观察操作的事实。

When data records are modified, their copies have to be updated accordingly. When talking about replication, we care most about three events: write, replica update, and read. These operations trigger a sequence of events initiated by the client. In some cases, updating replicas can happen after the write has finished from the client perspective, but this still does not change the fact that the client has to be able to observe operations in a particular order.

实现可用性

Achieving Availability

我们已经讨论了分布式系统的谬误,并发现了许多可能出错的地方。在现实世界中,节点并不总是处于活动状态或能够相互通信。然而,间歇性故障不应影响可用性:从用户的角度来看,整个系统必须继续运行,就好像什么也没发生一样。

We’ve talked about the fallacies of distributed systems and have identified many things that can go wrong. In the real world, nodes aren’t always alive or able to communicate with one another. However, intermittent failures should not impact availability: from the user’s perspective, the system as a whole has to continue operating as if nothing has happened.

系统可用性是一个非常重要的属性:在软件工程中,我们始终努力实现高可用性,并尽量减少停机时间。工程团队吹嘘他们的正常运行时间指标。我们如此关心可用性有几个原因:软件已经成为我们社会不可或缺的一部分,没有它,许多重要的事情就无法发生:银行交易、通信、旅行等等。

System availability is an incredibly important property: in software engineering, we always strive for high availability, and try to minimize downtime. Engineering teams brag about their uptime metrics. We care so much about availability for several reasons: software has become an integral part of our society, and many important things cannot happen without it: bank transactions, communication, travel, and so on.

对于公司来说,缺乏可用性可能意味着失去客户或金钱:如果在线商店出现故障,您将无法在网上购物,如果银行网站没有响应,您将无法转账。

For companies, lack of availability can mean losing customers or money: you can’t shop in the online store if it’s down, or transfer the money if your bank’s website isn’t responding.

为了使系统具有高可用性,我们需要以一种允许优雅地处理一个或多个参与者的故障或不可用的方式进行设计。为此,我们需要引入冗余和复制。然而,一旦添加冗余,我们就面临保持多个数据副本同步的问题,并且必须实现恢复机制。

To make the system highly available, we need to design it in a way that allows handling failures or unavailability of one or more participants gracefully. For that, we need to introduce redundancy and replication. However, as soon as we add redundancy, we face the problem of keeping several copies of data in sync and have to implement recovery mechanisms.

臭名昭著的CAP

Infamous CAP

可用性衡量系统成功响应每个请求的能力的属性。可用性的理论定义提到了最终响应,但当然,在现实世界的系统中,我们希望避免需要无限长时间响应的服务。

Availability is a property that measures the ability of the system to serve a response for every request successfully. The theoretical definition of availability mentions eventual response, but of course, in a real-world system, we’d like to avoid services that take indefinitely long to respond.

理想情况下,我们希望每个操作都是一致的。一致性在这里定义为原子性线性化一致性(参见“线性化”)。线性化历史可以表示为保留原始操作顺序的瞬时操作序列。线性化简化了对可能的系统状态的推理,并使分布式系统看起来就像在单台机器上运行一样。

Ideally, we’d like every operation to be consistent. Consistency is defined here as atomic or linearizable consistency (see “Linearizability”). Linearizable history can be expressed as a sequence of instantaneous operations that preserves the original operation order. Linearizability simplifies reasoning about the possible system states and makes a distributed system appear as if it was running on a single machine.

我们希望在容忍网络分区的同时实现一致性和可用性。网络可能会分成几个部分,其中进程无法相互通信:在分区节点之间发送的某些消息将无法到达目的地。

We would like to achieve both consistency and availability while tolerating network partitions. The network can get split into several parts where processes are not able to communicate with each other: some of the messages sent between partitioned nodes won’t reach their destinations.

可用性要求任何无故障的节点都能提供结果,而一致性则要求结果可线性化。由 Eric Brewer 提出的 CAP 猜想讨论了一致性、可用性和分区容错性之间的权衡[BREWER00]

Availability requires any nonfailing node to deliver results, while consistency requires results to be linearizable. CAP conjecture, formulated by Eric Brewer, discusses trade-offs between Consistency, Availability, and Partition tolerance [BREWER00].

可用性异步系统不可能满足需求,我们无法实现同时保证可用性一致性的系统网络分区 [ GILBERT02]我们可以构建在提供尽力而为可用性的同时保证强一致性的系统,或者在提供尽力而为一致性的同时保证可用性的系统[GILBERT12]。这里的尽力而为意味着,如果一切正常,系统不会故意违反任何保证,但在网络分区的情况下,允许削弱和违反保证。

Availability requirement is impossible to satisfy in an asynchronous system, and we cannot implement a system that simultaneously guarantees both availability and consistency in the presence of network partitions [GILBERT02]. We can build systems that guarantee strong consistency while providing best effort availability, or guarantee availability while providing best effort consistency [GILBERT12]. Best effort here implies that if everything works, the system will not purposefully violate any guarantees, but guarantees are allowed to be weakened and violated in the case of network partitions.

换句话说,CAP 描述了一系列潜在选择,在该范围的不同方面,我们拥有以下系统:

In other words, CAP describes a continuum of potential choices, where on different sides of the spectrum we have systems that are:

一致性和分区容忍性
Consistent and partition tolerant

阴极保护系统更喜欢失败的请求而不是提供可能不一致的数据。

CP systems prefer failing requests to serving potentially inconsistent data.

可用且分区容忍
Available and partition tolerant

AP 系统放宽了一致性要求,并允许在请求期间提供可能不一致的值。

AP systems loosen the consistency requirement and allow serving potentially inconsistent values during the request.

CP系统的一个例子是共识算法的实现,需要大多数节点才能进展:始终一致,但在网络分区的情况下可能不可用。只要有一个副本启动,数据库就始终接受写入并提供读取服务,这就是 AP 系统的一个示例,这可能最终会丢失数据或提供不一致的结果。

An example of a CP system is an implementation of a consensus algorithm, requiring a majority of nodes for progress: always consistent, but might be unavailable in the case of a network partition. A database always accepting writes and serving reads as long as even a single replica is up is an example of an AP system, which may end up losing data or serving inconsistent results.

PACELEC猜想 [ABADI12]是 CAP 的扩展,指出在存在网络分区的情况下,可以在一致性和可用性 (PAC) 之间进行选择。否则(E),即使系统正常运行,我们仍然要在延迟和一致性之间做出选择。

PACELEC conjecture [ABADI12], an extension of CAP, states that in presence of network partitions there’s a choice between consistency and availability (PAC). Else (E), even if the system is running normally, we still have to make a choice between latency and consistency.

谨慎使用 CAP

Use CAP Carefully

这一点很重要请注意,CAP 讨论的是网络分区,而不是节点崩溃或任何其他类型的故障(例如崩溃恢复)。与集群其余部分分开的节点可以服务不一致的请求,但崩溃的节点根本不会响应。一方面,这意味着没有必要关闭任何节点来面对一致性问题。另一方面,现实世界中的情况并非如此:存在许多不同的故障场景(其中一些可以通过网络分区来模拟)。

It’s important to note that CAP discusses network partitions rather than node crashes or any other type of failure (such as crash-recovery). A node, partitioned from the rest of the cluster, can serve inconsistent requests, but a crashed node will not respond at all. On the one hand, this implies that it’s not necessary to have any nodes down to face consistency problems. On the other hand, this isn’t the case in the real world: there are many different failure scenarios (some of which can be simulated with network partitions).

CAP 意味着即使所有节点都已启动,我们也可能面临一致性问题,但它们之间存在连接问题,因为我们期望每个非故障节点都能正确响应,而不管有多少节点可能已关闭。

CAP implies that we can face consistency problems even if all the nodes are up, but there are connectivity issues between them since we expect every nonfailed node to respond correctly, with no regard to how many nodes may be down.

CAP 猜想有时用三角形来表示,就好像我们可以转动旋钮并或多或少地获得所有三个参数。然而,虽然我们可以转动旋钮并以一致性换取可用性,但分区容错性是我们无法实际调整或换取任何东西的属性[HALE10]

CAP conjecture is sometimes illustrated as a triangle, as if we could turn a knob and have more or less of all of the three parameters. However, while we can turn a knob and trade consistency for availability, partition tolerance is a property we cannot realistically tune or trade for anything [HALE10].

提示

CAP 中一致性的定义与 ACID(参见第 5 章)对一致性的定义完全不同。酸一致性描述了事务一致性:事务将数据库从一种有效状态带到另一种有效状态,维护所有数据库不变量(例如唯一性约束和引用完整性)。在CAP中,它意味着操作是原子的(操作整体成功或失败)并且 一致(操作永远不会使数据处于不一致状态)。

Consistency in CAP is defined quite differently from what ACID (see Chapter 5) defines as consistency. ACID consistency describes transaction consistency: transaction brings the database from one valid state to another, maintaining all the database invariants (such as uniqueness constraints and referential integrity). In CAP, it means that operations are atomic (operations succeed or fail in their entirety) and consistent (operations never leave the data in an inconsistent state).

可用性CAP中也不同于前述的高可用性 [KLEPPMANN15]。CAP 定义对执行延迟没有限制。此外,与 CAP 相反,数据库的可用性并不要求每个非故障节点都响应每个请求。

Availability in CAP is also different from the aforementioned high availability [KLEPPMANN15]. The CAP definition puts no bounds on execution latency. Additionally, availability in databases, contrary to CAP, doesn’t require every nonfailed node to respond to every request.

CAP 猜想用于解释分布式系统、推理故障场景以及评估可能的情况,但重要的是要记住,放弃一致性和提供不可预测的结果之间存在微妙的界限。

CAP conjecture is used to explain distributed systems, reason about failure scenarios, and evaluate possible situations, but it’s important to remember that there’s a fine line between giving up consistency and serving unpredictable results.

声称处于可用性方面的数据库,如果使用正确,仍然能够从副本提供一致的结果,前提是有足够的活动副本。当然,还有更复杂的故障场景,CAP猜想只是一个经验法则,并不一定能说明全部事实。1

Databases that claim to be on the availability side, when used correctly, are still able to serve consistent results from replicas, given there are enough replicas alive. Of course, there are more complicated failure scenarios and CAP conjecture is just a rule of thumb, and it doesn’t necessarily tell the whole truth.1

收获与产量

Harvest and Yield

CAP猜想仅在其范围内讨论一致性和可用性最强形式:线性化和系统最终响应每个请求的能力。这迫使我们在这两个属性之间做出艰难的权衡。然而,一些应用程序可以从稍微宽松的假设中受益,我们可以以较弱的形式来考虑这些属性。

CAP conjecture discusses consistency and availability only in their strongest forms: linearizability and the ability of the system to eventually respond to every request. This forces us to make a hard trade-off between the two properties. However, some applications can benefit from slightly relaxed assumptions and we can think about these properties in their weaker forms.

系统可以提供宽松的保证,而不是一致或可用。我们可以定义两个可调参数指标:收获产量,在其中选择仍然构成正确的行为[FOX99]

Instead of being either consistent or available, systems can provide relaxed guarantees. We can define two tunable metrics: harvest and yield, choosing between which still constitutes correct behavior [FOX99]:

收成
Harvest

定义查询的完整程度:如果查询必须返回 100 行,但由于某些节点不可用而只能获取 99 行,那么它仍然比查询完全失败并且不返回任何内容要好。

Defines how complete the query is: if the query has to return 100 rows, but can fetch only 99 due to unavailability of some nodes, it still can be better than failing the query completely and returning nothing.

屈服
Yield

指定与尝试的请求总数相比已成功完成的请求数。产量与正常运行时间不同,因为,例如,繁忙的节点没有关闭,但仍然可能无法响应某些请求。

Specifies the number of requests that were completed successfully, compared to the total number of attempted requests. Yield is different from the uptime, since, for example, a busy node is not down, but still can fail to respond to some of the requests.

这将权衡的焦点从绝对项转移到相对项。我们可以用收获换取产量,并允许某些请求返回不完整的数据。提高产量的方法之一是仅从可用分区返回查询结果(请参阅“数据库分区”)。例如,如果存储某些用户记录的节点子集发生故障,我们仍然可以继续为其他用户提供请求。或者,我们可以要求仅完整返回关键应用程序数据,但允许其他请求存在一些偏差。

This shifts the focus of the trade-off from the absolute to the relative terms. We can trade harvest for yield and allow some requests to return incomplete data. One of the ways to increase yield is to return query results only from the available partitions (see “Database Partitioning”). For example, if a subset of nodes storing records of some users is down, we can still continue serving requests for other users. Alternatively, we can require the critical application data to be returned only in its entirety, but allow some deviations for other requests.

定义、衡量并在收获和产量之间做出有意识的选择有助于我们构建更能抵御故障的系统。

Defining, measuring, and making a conscious choice between harvest and yield helps us to build systems that are more resilient to failures.

共享内存

Shared Memory

为了作为客户端,存储数据的分布式系统就像具有共享存储一样,类似于单节点系统。节点间通信和消息传递被抽象出来并在幕后发生。这造成了共享记忆的错觉。

For a client, the distributed system storing the data acts as if it has shared storage, similar to a single-node system. Internode communication and message passing are abstracted away and happen behind the scenes. This creates an illusion of a shared memory.

可通过读或写操作访问的单个存储单元通常称为一个寄存器。我们可以将分布式数据库中的共享内存视为此类寄存器的数组

A single unit of storage, accessible by read or write operations, is usually called a register. We can view shared memory in a distributed database as an array of such registers.

我们通过调用完成事件来识别每个操作。如果调用操作的进程在完成之前崩溃,我们将操作定义为失败。如果一个操作的调用和完成事件都发生在调用另一个操作之前,我们就说这个操作先于另一个操作,并且这两个操作顺序的。否则,我们说它们是并发的

We identify every operation by its invocation and completion events. We define an operation as failed if the process that invoked it crashes before it completes. If both invocation and completion events for one operation happen before the other operation is invoked, we say that this operation precedes the other one, and these two operations are sequential. Otherwise, we say that they are concurrent.

图11-1中,您可以看到流程和执行不同的操作:P1P2

In Figure 11-1, you can see processes P1 and P2 executing different operations:

  • a) process 执行的操作在by 执行的操作完成后才开始,并且这两个操作是顺序的P2P1

  • a) The operation performed by process P2 starts after the operation executed by P1 has already finished, and the two operations are sequential.

  • b) 两个操作之间有重叠,因此这些操作是并发的

  • b) There’s an overlap between the two operations, so these operations are concurrent.

  • c) 执行的操作在 执行的操作之后开始并在执行的操作之前完成。这些操作也是并发的P2P1

  • c) The operation executed by P2 starts after and completes before the operation executed by P1. These operations are concurrent, too.

数据库1101
图 11-1。顺序和并发操作

多个读取器或写入器可以同时访问寄存器。对寄存器的读写操作不是立即进行的,需要一些时间。不同进程并发读/写操作不是串行的:根据操作重叠时寄存器的行为方式,它们的顺序可能不同,并且可能产生不同的结果。根据并发操作时寄存器的行为方式,我们区分三种类型的寄存器:

Multiple readers or writers can access the register simultaneously. Read and write operations on registers are not immediate and take some time. Concurrent read/write operations performed by different processes are not serial: depending on how registers behave when operations overlap, they might be ordered differently and may produce different results. Depending on how the register behaves in the presence of concurrent operations, we distinguish among three types of registers:

安全的
Safe

读取安全寄存器在并发写操作期间可以返回寄存器范围内的任意值(这听起来不太实用,但可能描述了不强加顺序的异步系统的语义)在并发读取和写入期间,具有二进制值的安全寄存器可能会出现闪烁(即返回两个值之间交替的结果)。

Reads to the safe registers may return arbitrary values within the range of the register during a concurrent write operation (which does not sound very practical, but might describe the semantics of an asynchronous system that does not impose the order). Safe registers with binary values might appear to be flickering (i.e., returning results alternating between the two values) during reads concurrent to writes.

常规的
Regular

对于常规寄存器,我们有稍微更强的保证:读操作只能返回最近完成的写入写入的值或与当前读取重叠的写入操作写入的值。在这种情况下,系统有一些顺序概念,但写入结果并不同时对所有读取器可见(例如,这可能发生在复制数据库中,其中主服务器接受写入并将它们复制到提供读取服务的工作程序)。

For regular registers, we have slightly stronger guarantees: a read operation can return only the value written by the most recent completed write or the value written by the write operation that overlaps with the current read. In this case, the system has some notion of order, but write results are not visible to all the readers simultaneously (for example, this may happen in a replicated database, where the master accepts writes and replicates them to workers serving reads).

原子
Atomic

原子寄存器保证线性化:每个写操作都有一个时刻,在此之前每个读操作都会返回一个旧值,之后每个读操作都会返回一个新值。原子性是简化系统状态推理的基本属性。

Atomic registers guarantee linearizability: every write operation has a single moment before which every read operation returns an old value and after which every read operation returns a new one. Atomicity is a fundamental property that simplifies reasoning about the system state.

订购

Ordering

什么时候我们看到一系列事件,我们对它们的执行顺序有一些直觉。然而,在分布式系统中,这并不总是那么容易,因为很难知道具体发生的事情是什么时候发生的,也很难在整个集群中立即获得这些信息。每个参与者都可能有自己的状态视图,因此我们必须查看每个操作并根据其调用完成事件来定义它,并描述操作范围。

When we see a sequence of events, we have some intuition about their execution order. However, in a distributed system it’s not always that easy, because it’s hard to know when exactly something has happened and have this information available instantly across the cluster. Each participant may have its view of the state, so we have to look at every operation and define it in terms of its invocation and completion events and describe the operation bounds.

让我们定义一个进程可以在共享寄存器上执行read(register)write(register, value)操作的系统。每个进程按顺序执行自己的一组操作(即,每个调用的操作必须完成才能开始下一个操作)。顺序流程执行的组合形成全局历史记录,其中操作可以并发执行。

Let’s define a system in which processes can execute read(register) and write(register, value) operations on shared registers. Each process executes its own set of operations sequentially (i.e., every invoked operation has to complete before it can start the next one). The combination of sequential process executions forms a global history, in which operations can be executed concurrently.

考虑一致性模型的最简单方法是根据读取和写入操作以及它们可以重叠的方式:读取操作没有副作用,而写入会更改寄存器状态。这有助于推断写入后数据何时变得可读。例如,考虑两个进程同时执行以下事件的历史记录:

The simplest way to think about consistency models is in terms of read and write operations and ways they can overlap: read operations have no side effects, while writes change the register state. This helps to reason about when exactly data becomes readable after the write. For example, consider a history in which two processes execute the following events concurrently:

流程 1: 流程 2:
写(x,1)读(x)
                读(x)
Process 1:      Process 2:
write(x, 1)     read(x)
                read(x)

当查看这些事件时,并不清楚read(x)这两种情况下的操作结果是什么。我们有几种可能的历史:

When looking at these events, it’s unclear what is an outcome of the read(x) operations in both cases. We have several possible histories:

  • 写入在两次读取之前完成。

  • Write completes before both reads.

  • 写入和两次读取可以交错,并且可以在读取之间执行。

  • Write and two reads can get interleaved, and can be executed between the reads.

  • 两次读取都在写入之前完成。

  • Both reads complete before the write.

如果我们只有一份数据副本会发生什么,没有简单的答案。在复制系统中,我们有更多可能状态的组合,当我们有多个进程读取和写入数据时,情况会变得更加复杂。

There’s no simple answer to what should happen if we have just one copy of data. In a replicated system, we have more combinations of possible states, and it can get even more complicated when we have multiple processes reading and writing the data.

如果所有这些操作都由单个进程执行,我们可以强制执行严格的事件顺序,但对于多个进程来说很难做到这一点。我们可以将潜在的困难分为两类:

If all of these operations were executed by the single process, we could enforce a strict order of events, but it’s harder to do so with multiple processes. We can group the potential difficulties into two groups:

  • 操作可能会重叠。

  • Operations may overlap.

  • 非重叠调用的效果可能不会立即显现。

  • Effects of the nonoverlapping calls might not be visible immediately.

为了推理操作顺序并对可能的结果进行明确的描述,我们必须定义一致性模型。我们从共享内存和并发系统的角度讨论分布式系统中的并发性,因为大多数定义一致性的定义和规则仍然适用尽管并发系统和分布式系统之间有很多术语重叠,但由于通信模式、性能和可靠性方面的差异,我们无法直接应用大多数并发算法。

To reason about the operation order and have nonambiguous descriptions of possible outcomes, we have to define consistency models. We discuss concurrency in distributed systems in terms of shared memory and concurrent systems, since most of the definitions and rules defining consistency still apply. Even though a lot of terminology between concurrent and distributed systems overlap, we can’t directly apply most of the concurrent algorithms, because of differences in communication patterns, performance, and reliability.

一致性模型

Consistency Models

自从由于共享内存寄存器上的操作允许重叠,我们应该定义清晰的语义:如果多个客户端同时或在短时间内读取或修改不同的数据副本,会发生什么情况。这个问题没有单一的正确答案,因为这些语义根据应用程序的不同而不同,但它们在一致性模型的背景下得到了很好的研究。

Since operations on shared memory registers are allowed to overlap, we should define clear semantics: what happens if multiple clients read or modify different copies of data simultaneously or within a short period. There’s no single right answer to that question, since these semantics are different depending on the application, but they are well studied in the context of consistency models.

一致性模型提供不同的语义和保证。您可以将一致性模型视为参与者之间的契约:每个副本必须做什么才能满足所需的语义,以及用户在发出读写操作时可以期望什么。

Consistency models provide different semantics and guarantees. You can think of a consistency model as a contract between the participants: what each replica has to do to satisfy the required semantics, and what users can expect when issuing read and write operations.

一致性模型描述了客户端对可能的返回值的期望,尽管存在多个数据副本和并发访问。在本节,我们将讨论单操作一致性模型。

Consistency models describe what expectations clients might have in terms of possible returned values despite the existence of multiple copies of data and concurrent accesses to it. In this section, we will discuss single-operation consistency models.

每个模型都描述了系统的行为与我们可能期望或认为自然的行为有多远。它帮助我们区分交错操作的“所有可能的历史”和“模型 X 下允许的历史”,这大大简化了关于状态变化可见性的推理。

Each model describes how far the behavior of the system is from the behavior we might expect or find natural. It helps us to distinguish between “all possible histories” of interleaving operations and “histories permissible under model X,” which significantly simplifies reasoning about the visibility of state changes.

我们可以从以下角度来思考一致性状态,描述哪些状态不变量是可接受的,并在放置在不同副本上的数据副本之间建立允许的关系。或者,我们可以考虑 操作一致性,它提供数据存储的外部视图、描述操作并对它们发生的顺序施加约束[TANENBAUM06] [AGUILERA16]

We can think about consistency from the perspective of state, describe which state invariants are acceptable, and establish allowable relationships between copies of the data placed onto different replicas. Alternatively, we can consider operation consistency, which provides an outside view on the data store, describes operations, and puts constraints on the order in which they occur [TANENBAUM06] [AGUILERA16].

如果没有全局时钟,就很难为分布式操作提供精确且确定的顺序。这就像数据的狭义相对论:每个参与者对状态和时间都有自己的看法。

Without a global clock, it is difficult to give distributed operations a precise and deterministic order. It’s like a Special Relativity Theory for data: every participant has its own perspective on state and time.

理论上,每次我们想要更改系统状态时,我们都可以获取系统范围的锁,但这非常不切实际。相反,我们使用一组规则、定义和限制来限制可能的历史和结果的数量。

Theoretically, we could grab a system-wide lock every time we want to change the system state, but it’d be highly impractical. Instead, we use a set of rules, definitions, and restrictions that limit the number of possible histories and outcomes.

一致性模型为我们在“Inknown CAP”中讨论的内容添加了另一个维度。现在我们不仅要兼顾一致性和可用性,还要考虑同步成本方面的一致性[ATTIYA94]。同步成本可能包括延迟、执行额外操作所花费的额外 CPU 周期、用于保存恢复信息的磁盘 I/O、等待时间、网络 I/O 以及通过避免同步可以避免的所有其他情况。

Consistency models add another dimension to what we discussed in “Infamous CAP”. Now we have to juggle not only consistency and availability, but also consider consistency in terms of synchronization costs [ATTIYA94]. Synchronization costs may include latency, additional CPU cycles spent executing additional operations, disk I/O used to persist recovery information, wait time, network I/O, and everything else that can be prevented by avoiding synchronization.

首先,我们将重点关注操作结果的可见性和传播。回到并发读取和写入的示例,我们将能够通过将相关写入依次定位或定义传播新值的点来限制可能的历史记录数量。

First, we’ll focus on visibility and propagation of operation results. Coming back to the example with concurrent reads and writes, we’ll be able to limit the number of possible histories by either positioning dependent writes after one another or defining a point at which the new value is propagated.

我们从进程(客户端)发布readwrite针对数据库状态的操作方面讨论一致性模型。由于我们在复制数据的上下文中讨论一致性,因此我们假设数据库可以有多个副本。

We discuss consistency models in terms of processes (clients) issuing read and write operations against the database state. Since we discuss consistency in the context of replicated data, we assume that the database can have multiple replicas.

严格一致性

Strict Consistency

严格一致性相当于完全的复制透明性:任何进程的任何写入都可以立即供任何进程的后续读取使用。它涉及到全局时钟的概念,如果有一个write(x, 1)at时刻,any就会在任何时刻返回一个新写入的值。t1read(x)1t2 > t1

Strict consistency is the equivalent of complete replication transparency: any write by any process is instantly available for the subsequent reads by any process. It involves the concept of a global clock and, if there was a write(x, 1) at instant t1, any read(x) will return a newly written value 1 at any instant t2 > t1.

不幸的是,这只是一个理论模型,不可能实现,因为物理定律和分布式系统的工作方式限制了事情发生的速度[SINHA97]

Unfortunately, this is just a theoretical model, and it’s impossible to implement, as the laws of physics and the way distributed systems work set limits on how fast things may happen [SINHA97].

线性度

Linearizability

线性度最强的单对象、单操作一致性模型。在这种模型下,写入的效果在其开始和结束之间的某个时间点对所有读取者都可见一次,并且没有客户端可以观察部分(即未完成、仍在进行中)或不完整的状态转换或副作用(即在完成之前中断)写操作[LEE15]

Linearizability is the strongest single-object, single-operation consistency model. Under this model, effects of the write become visible to all readers exactly once at some point in time between its start and end, and no client can observe state transitions or side effects of partial (i.e., unfinished, still in-flight) or incomplete (i.e., interrupted before completion) write operations [LEE15].

并发操作被表示为可见性属性所保留的可能的顺序历史之一。线性化存在一些不确定性,因为可能存在不止一种事件排序方式[HERLIHY90]

Concurrent operations are represented as one of the possible sequential histories for which visibility properties hold. There is some indeterminism in linearizability, as there may exist more than one way in which the events can be ordered [HERLIHY90].

如果两个操作重叠,它们可以按任意顺序生效。写操作完成后发生的所有读操作都可以观察到该操作的效果。一旦单个读取操作返回特定值,其后的所有读取都会返回至少与它返回的值一样新的值[BAILIS14a]

If two operations overlap, they may take effect in any order. All read operations that occur after write operation completion can observe the effects of this operation. As soon as a single read operation returns a particular value, all reads that come after it return the value at least as recent as the one it returns [BAILIS14a].

全球历史中并发事件发生的顺序具有一定的灵活性,但不能任意重新排序。操作结果不应在操作开始之前生效,因为这需要预言机能够预测未来的操作。同时,结果必须在完成之前生效,否则我们无法定义线性化点。

There is some flexibility in terms of the order in which concurrent events occur in a global history, but they cannot be reordered arbitrarily. Operation results should not become effective before the operation starts as that would require an oracle able to predict future operations. At the same time, results have to take effect before completion, since otherwise, we cannot define a linearization point.

线性化既考虑顺序的进程本地操作顺序,也考虑相对于其他进程并行运行的操作顺序,并定义事件的总顺序。

Linearizability respects both sequential process-local operation order and the order of operations running in parallel relative to other processes, and defines a total order of the events.

这个顺序应该是一致的,这意味着每次读取共享值都应该返回本次读取之前写入此共享变量的最新值,或者与本次读取重叠的写入值。对共享变量的可线性化写入访问还意味着互斥:在两个并发写入之间,只有一个可以先写入。

This order should be consistent, which means that every read of the shared value should return the latest value written to this shared variable preceding this read, or the value of a write that overlaps with this read. Linearizable write access to a shared variable also implies mutual exclusion: between the two concurrent writes, only one can go first.

尽管操作是并发的并且有一些重叠,但它们的效果以某种方式变得可见,使它们看起来是连续的。没有操作立即发生,但仍然看起来是原子的。

Even though operations are concurrent and have some overlap, their effects become visible in a way that makes them appear sequential. No operation happens instantaneously, but still appears to be atomic.

让我们考虑以下历史:

Let’s consider the following history:

流程 1: 流程 2: 流程 3:
写(x,1)写(x,2)读(x)
                               读(x)
                               读(x)
Process 1:      Process 2:     Process 3:
write(x, 1)     write(x, 2)    read(x)
                               read(x)
                               read(x)

图11-2中,我们有三个进程,其中两个对寄存器执行写操作x,寄存器的初始值为。读取操作可以通过以下方式之一观察这些写入:

In Figure 11-2, we have three processes, two of which perform write operations on the register x, which has an initial value of . Read operations can observe these writes in one of the following ways:

  • a) 第一个读取操作可以返回12、 或(初始值,两次写入之前的状态),因为两次写入仍在进行中。第一次读取可以在两次写入之前、第一次和第二次写入之间以及两次写入之后排序。

  • a) The first read operation can return 1, 2, or (the initial value, a state before both writes), since both writes are still in-flight. The first read can get ordered before both writes, between the first and second writes, and after both writes.

  • b) 第二次读操作只能返回12,因为第一次写操作已经完成,但第二次写操作还没有返回。

  • b) The second read operation can return only 1 and 2, since the first write has completed, but the second write didn’t return yet.

  • c) 第三次读取只能返回2,因为第二次写入是在第一次写入之后进行的。

  • c) The third read can only return 2, since the second write is ordered after the first.

数据库1102
图 11-2。线性化示例

线性化点

Linearization point

线性化最重要的特征之一是可见性:一旦操作完成,每个人都必须看到它,并且系统无法“回到过去”,将其恢复或使其对某些参与者不可见。换句话说,线性化禁止陈旧读取并要求读取是单调的。

One of the most important traits of linearizability is visibility: once the operation is complete, everyone must see it, and the system can’t “travel back in time,” reverting it or making it invisible for some participants. In other words, linearization prohibits stale reads and requires reads to be monotonic.

这种一致性模型最好用原子(即不可中断、不可分割)操作来解释。操作不一定是瞬时的(也是因为不存在这样的事情),但它们的效果必须在某个时间点变得可见,让人产生它们是瞬时的错觉。这个时刻称为线性化点

This consistency model is best explained in terms of atomic (i.e., uninterruptible, indivisible) operations. Operations do not have to be instantaneous (also because there’s no such thing), but their effects have to become visible at some point in time, making an illusion that they were instantaneous. This moment is called a linearization point.

超过写入操作的线性化点(换句话说,当该值对其他进程可见时),每个进程都必须看到该操作写入的值或稍后的值(如果在其之后订购了一些其他写入操作)。可见值应保持稳定,直到下一个值变得可见,并且寄存器不应在两个最近状态之间交替。

Past the linearization point of the write operation (in other words, when the value becomes visible for other processes) every process has to see either the value this operation wrote or some later value, if some additional write operations are ordered after it. A visible value should remain stable until the next one becomes visible after it, and the register should not alternate between the two recent states.

笔记

最多现在的编程语言都提供了允许原子writecompare-and-swap(CAS) 操作的原子原语。与 CAS 不同,原子write操作不考虑当前寄存器值,仅当前一个值不变时才从一个值移动到下一个值[HERLIHY94]。读取值,修改它,然后用 CAS 写入它比简单地检查和设置值更复杂,因为可能的ABA 问题 [DECHEV10]:如果 CAS 期望该值A存在于寄存器中,则即使该值已设置然后由其他两个并发写入操作B切换回,也会安装该值。A换句话说,该值A本身的存在并不能保证该值自上次读取以来没有被更改。

Most of the programming languages these days offer atomic primitives that allow atomic write and compare-and-swap (CAS) operations. Atomic write operations do not consider current register values, unlike CAS, that move from one value to the next only when the previous value is unchanged [HERLIHY94]. Reading the value, modifying it, and then writing it with CAS is more complex than simply checking and setting the value, because of the possible ABA problem [DECHEV10]: if CAS expects the value A to be present in the register, it will be installed even if the value B was set and then switched back to A by the other two concurrent write operations. In other words, the presence of the value A alone does not guarantee that the value hasn’t been changed since the last read.

线性化点作为截止点,之后运行效果可见。我们可以通过使用锁来保护临界区、原子读/写或读-修改-写原语来实现它。

The linearization point serves as a cutoff, after which operation effects become visible. We can implement it by using locks to guard a critical section, atomic read/write, or read-modify-write primitives.

图 11-3显示线性化假设硬时间界限并且时钟是实时的,因此操作效果必须在 发出操作请求时 和进程收到响应时 之间变得 可见。t1t2

Figure 11-3 shows that linearizability assumes hard time bounds and the clock is real time, so the operation effects have to become visible between t1, when the operation request was issued, and t2, when the process received a response.

数据库1103
图 11-3。线性化操作的时间范围

图 11-4说明了线性化点将历史分为beforeafter。在线性化点之前,旧值可见,在线性化点之后,新值可见。

Figure 11-4 illustrates that the linearization point cuts the history into before and after. Before the linearization point, the old value is visible, after it, the new value is visible.

数据库1104
图 11-4。线性化点

线性化成本

Cost of linearizability

如今,许多系统都避免实现线性化。默认情况下,即使 CPU 在访问主内存时也不提供线性化能力。发生这种情况的原因是同步指令成本高昂、速度缓慢,并且涉及跨节点 CPU 流量和缓存失效。然而,可以使用低级原语[MCKENNEY05a][MCKENNEY05b]来实现线性化。

Many systems avoid implementing linearizability today. Even CPUs do not offer linearizability when accessing main memory by default. This has happened because synchronization instructions are expensive, slow, and involve cross-node CPU traffic and cache invalidations. However, it is possible to implement linearizability using low-level primitives [MCKENNEY05a], [MCKENNEY05b].

在并发编程中,您可以使用比较和交换操作来引入线性化。许多算法的工作原理是准备结果,然后使用 CAS 交换指针并发布它们。例如,我们可以通过创建一个链表节点然后自动将其附加到列表的尾部来实现并发队列[KHANCHANDANI18]

In concurrent programming, you can use compare-and-swap operations to introduce linearizability. Many algorithms work by preparing results and then using CAS for swapping pointers and publishing them. For example, we can implement a concurrent queue by creating a linked list node and then atomically appending it to the tail of the list [KHANCHANDANI18].

在分布式系统中,线性化需要协调和排序。可以实施使用共识:客户端使用消息与复制存储进行交互,共识模块负责确保应用操作在整个集群中保持一致和相同。每个写操作都会立即出现,在其调用和完成事件之间的某个时刻恰好出现一次[HOWARD14]

In distributed systems, linearizability requires coordination and ordering. It can be implemented using consensus: clients interact with a replicated store using messages, and the consensus module is responsible for ensuring that applied operations are consistent and identical across the cluster. Each write operation will appear instantaneously, exactly once at some point between its invocation and completion events [HOWARD14].

有趣的是,线性化在其传统理解中,它被视为本地财产,意味着独立实施和验证的元素的组成。组合可线性化的历史会产生同样可线性化的历史[HERLIHY90]。换句话说,所有对象均可线性化的系统也是可线性化的。这是一个非常有用的属性,但我们应该记住,它的范围仅限于单个对象,并且即使两个独立对象上的操作是线性化的,涉及两个对象的操作也必须依赖于额外的同步手段。

Interestingly, linearizability in its traditional understanding is regarded as a local property and implies composition of independently implemented and verified elements. Combining linearizable histories produces a history that is also linearizable [HERLIHY90]. In other words, a system in which all objects are linearizable, is also linearizable. This is a very useful property, but we should remember that its scope is limited to a single object and, even though operations on two independent objects are linearizable, operations that involve both objects have to rely on additional synchronization means.

顺序一致性

Sequential Consistency

实现线性化可能成本太高,但可以放松模型,同时仍然提供相当强的一致性保证。顺序一致性允许对操作进行排序,就好像它们是按某种顺序执行的一样,同时要求每个单独进程的操作按照进程执行的相同顺序执行。

Achieving linearizability might be too expensive, but it is possible to relax the model, while still providing rather strong consistency guarantees. Sequential consistency allows ordering operations as if they were executed in some sequential order, while requiring operations of each individual process to be executed in the same order they were performed by the process.

进程可以观察其他参与者按照与自己的历史一致的顺序执行的操作,但从全局角度来看,这种视图可能是任意陈旧的[KINGSBURY18a]进程之间的执行顺序是未定义的,因为没有共享的时间概念。

Processes can observe operations executed by other participants in the order consistent with their own history, but this view can be arbitrarily stale from the global perspective [KINGSBURY18a]. Order of execution between processes is undefined, as there’s no shared notion of time.

顺序一致性最初是在并发上下文中引入的,将其描述为正确执行多处理器程序的一种方法。原始描述要求对同一单元的内存请求在队列中排序(FIFO,到达顺序),没有对独立内存单元的重叠写入强加全局排序,并允许读取从内存单元获取值,或者如果队列非空,则为队列中的最新值[LAMPORT79]。这个例子有助于理解顺序一致性的语义。操作可以以不同的方式排序(取决于到达顺序,甚至在两个写入同时到达的情况下任意排序),但所有进程都以相同的顺序观察操作。

Sequential consistency was initially introduced in the context of concurrency, describing it as a way to execute multiprocessor programs correctly. The original description required memory requests to the same cell to be ordered in the queue (FIFO, arrival order), did not impose global ordering on the overlapping writes to independent memory cells, and allowed reads to fetch the value from the memory cell, or the latest value from the queue if the queue was nonempty [LAMPORT79]. This example helps to understand the semantics of sequential consistency. Operations can be ordered in different ways (depending on the arrival order, or even arbitrarily in case two writes arrive simultaneously), but all processes observe the operations in the same order.

每个进程都可以按照自己的程序指定的顺序发出读写请求,非常直观。任何非并发、单线程程序都以这种方式执行其步骤:一个接着一个。从同一进程传播的所有写操作都按照该进程提交的顺序出现。从不同来源传播的操作可以任意排序,但从读者的角度来看,这个顺序将是一致的。

Each process can issue read and write requests in an order specified by its own program, which is very intuitive. Any nonconcurrent, single-threaded program executes its steps this way: one after another. All write operations propagating from the same process appear in the order they were submitted by this process. Operations propagating from different sources may be ordered arbitrarily, but this order will be consistent from the readers’ perspective.

笔记

顺序一致性经常与线性化相混淆,因为两者具有相似的语义。顺序一致性和线性化一样,要求操作是全局有序的,但线性化要求每个进程的局部顺序和全局顺序一致。换句话说,线性化尊重实时操作顺序。在顺序一致性下,排序仅适用于同源写入[VIOTTI16]。另一个重要的区别是组合:我们可以组合可线性化的历史,并且仍然期望结果是可线性化的,而顺序一致的调度是不可组合的[ATTIYA94]

Sequential consistency is often confused with linearizability since both have similar semantics. Sequential consistency, just as linearizability, requires operations to be globally ordered, but linearizability requires the local order of each process and global order to be consistent. In other words, linearizability respects a real-time operation order. Under sequential consistency, ordering holds only for the same-origin writes [VIOTTI16]. Another important distinction is composition: we can combine linearizable histories and still expect results to be linearizable, while sequentially consistent schedules are not composable [ATTIYA94].

图 11-5显示了write(x,1)和如何write(x,2)对和可见。即使用挂钟术语来说,是在 之前编写的,但它可以在 之后订购。同时,虽然已经读取了该值,但仍然可以读取。但是,只要对于不同的读者而言,这两个命令和都是有效的。这里重要的是 和 都以相同的顺序观察到值:首先是 [TANENBAUM14] ,然后是[TANENBAUM14]P3P41 22P31P421 → 22 → 1P3P421

Figure 11-5 shows how write(x,1) and write(x,2) can become visible to P3 and P4. Even though in wall-clock terms, 1 was written before 2, it can get ordered after 2. At the same time, while P3 already reads the value 1, P4 can still read 2. However, both orders, 1 → 2 and 2 → 1, are valid, as long as they’re consistent for different readers. What’s important here is that both P3 and P4 have observed values in the same order: first 2, and then 1 [TANENBAUM14].

数据库1105
图 11-5。顺序一致性排序

例如,过时的读取可以通过副本分歧来解释:即使写入以相同的顺序传播到不同的副本,它们也可以在不同的时间到达那里。

Stale reads can be explained, for example, by replica divergence: even though writes propagate to different replicas in the same order, they can arrive there at different times.

线性化的主要区别是缺乏全局强制的时间限制。在线性化下,操作必须在其挂钟时间范围内变得有效。当写入W₁操作完成时,必须应用其结果,并且每个读者都应该能够看到至少与 写入的值一样新的值W₁。类似地,在读取操作R₁返回之后,在其之后发生的任何读取操作都应该返回已经看到的值R₁或稍后的值(当然,这必须遵循相同的规则)。

The main difference with linearizability is the absence of globally enforced time bounds. Under linearizability, an operation has to become effective within its wall-clock time bounds. By the time the write W₁ operation completes, its results have to be applied, and every reader should be able to see the value at least as recent as one written by W₁. Similarly, after a read operation R₁ returns, any read operation that happens after it should return the value that R₁ has seen or a later value (which, of course, has to follow the same rule).

顺序一致性放宽了这一要求:只要从各个处理器的角度来看顺序是一致的,操作的结果就可以在完成变得可见。同源写入不能互相“跳跃”:必须保留它们相对于它们自己的执行过程的程序顺序。另一个限制是操作出现的顺序对于所有读者必须保持一致。

Sequential consistency relaxes this requirement: an operation’s results can become visible after its completion, as long as the order is consistent from the individual processors’ perspective. Same-origin writes can’t “jump” over each other: their program order, relative to their own executing process, has to be preserved. The other restriction is that the order in which operations have appeared must be consistent for all readers.

与线性化类似,现代 CPU 默认情况下不保证顺序一致性,并且由于处理器可以对指令重新排序,因此我们应该使用内存屏障(也称为栅栏)来确保写入按顺序对并发运行的线程可见 [DREPPER07] [ GEORGOPOULOS16 ]

Similar to linearizability, modern CPUs do not guarantee sequential consistency by default and, since the processor can reorder instructions, we should use memory barriers (also called fences) to make sure that writes become visible to concurrently running threads in order [DREPPER07] [GEORGOPOULOS16].

因果一致性

Causal Consistency

你看,只有一个常数、一个普遍性,它是唯一真实的真理:因果关系。行动。反应。因果。

《黑客帝国:重装上阵》中的墨洛温王朝

You see, there is only one constant, one universal, it is the only real truth: causality. Action. Reaction. Cause and effect.

Merovingian from The Matrix Reloaded

甚至尽管通常不需要全局操作顺序,但可能有必要在某些操作之间建立顺序。在因果一致性模型下,所有流程都必须以相同的顺序查看因果相关的操作。不同的处理器可以以不同的顺序观察到没有因果关系的并发写入。

Even though having a global operation order is often unnecessary, it might be necessary to establish order between some operations. Under the causal consistency model, all processes have to see causally related operations in the same order. Concurrent writes with no causal relationship can be observed in a different order by different processors.

首先,我们来看看为什么需要因果关系,以及没有因果关系的写入如何传播。在图 11-6中,处理和进行不按因果顺序的写入。这些操作的结果可以在不同时间并且无序地传播给读者。进程会先看到该值,然后再看到。P1P2P312P421

First, let’s take a look at why we need causality and how writes that have no causal relationship can propagate. In Figure 11-6, processes P1 and P2 make writes that aren’t causally ordered. The results of these operations can propagate to readers at different times and out of order. Process P3 will see the value 1 before it sees 2, while P4 will first see 2, and then 1.

数据库1106
图 11-6。没有因果关系的写操作

图 11-7显示了因果相关写入的示例。除了写入值之外,我们现在还必须指定一个逻辑时钟值,该值将在操作之间建立因果顺序。从写操作开始,从初始值开始。执行另一个写操作 ,并指定它在逻辑上排序在 之后,要求操作按照逻辑时钟建立的顺序传播。P1write(x,∅,1)→t1P2write(x, t1, 2) t1

Figure 11-7 shows an example of causally related writes. In addition to a written value, we now have to specify a logical clock value that would establish a causal order between operations. P1 starts with a write operation write(x,∅,1)→t1, which starts from the initial value . P2 performs another write operation, write(x, t1, 2), and specifies that it is logically ordered after t1, requiring operations to propagate only in the order established by the logical clock.

数据库1107
图 11-7。因果相关的写操作

这在这些操作之间建立了因果顺序。即使后一个写入的传播速度比前一个写入快,它也不会变得可见,直到其所有依赖项到达,并且事件顺序是根据其逻辑时间戳重建的。换句话说,之前发生的关系是在逻辑上建立的,无需使用物理时钟,并且所有进程都同意此顺序。

This establishes a causal order between these operations. Even if the latter write propagates faster than the former one, it isn’t made visible until all of its dependencies arrive, and the event order is reconstructed from their logical timestamps. In other words, a happened-before relationship is established logically, without using physical clocks, and all processes agree on this order.

图 11-8显示了进程和进行因果相关的写入,这些写入按其逻辑顺序传播。这可以防止我们出现图11-6所示的情况;您可以比较两个图中和的历史。P1P2P3P4P3P4

Figure 11-8 shows processes P1 and P2 making causally related writes, which propagate to P3 and P4 in their logical order. This prevents us from the situation shown in Figure 11-6; you can compare histories of P3 and P4 in both figures.

数据库1108
图 11-8。具有因果关系的写操作

您可以将其视为某些在线论坛上的交流:您在网上发布一些内容,有人看到您的帖子并做出回复,第三人看到此回复并继续对话线程。对话线程可能会出现分歧:您可以选择响应线程中的其中一个对话并继续事件链,但某些线程只有少数公共消息,因此可能没有所有对话的单一历史记录的消息。

You can think of this in terms of communication on some online forum: you post something online, someone sees your post and responds to it, and a third person sees this response and continues the conversation thread. It is possible for conversation threads to diverge: you can choose to respond to one of the conversations in the thread and continue the chain of events, but some threads will have only a few messages in common, so there might be no single history for all the messages.

在因果一致的系统中,我们获得应用程序的会话保证,确保数据库的视图与其自身的操作一致,即使它针对不同的、可能不一致的服务器执行读写请求 [TERRY94 ]。这些保证是:单调读取、单调写入、读你的写、写跟随读。您可以在“会话模型”中找到有关这些会话模型的更多信息。

In a causally consistent system, we get session guarantees for the application, ensuring the view of the database is consistent with its own actions, even if it executes read and write requests against different, potentially inconsistent, servers [TERRY94]. These guarantees are: monotonic reads, monotonic writes, read-your-writes, writes-follow-reads. You can find more information on these session models in “Session Models”.

因果一致性可以使用逻辑时钟[LAMPORT78]并随每条消息发送上下文元数据来实现,总结哪些操作在逻辑上先于当前操作。当从服务器接收到更新时,它包含上下文的最新版本。仅当任何操作之前的所有操作都已应用时,才能处理该操作。上下文不匹配的消息会在服务器上进行缓冲,因为现在传送它们还为时过早。

Causal consistency can be implemented using logical clocks [LAMPORT78] and sending context metadata with every message, summarizing which operations logically precede the current one. When the update is received from the server, it contains the latest version of the context. Any operation can be processed only if all operations preceding it have already been applied. Messages for which contexts do not match are buffered on the server as it is too early to deliver them.

实现因果一致性的两个著名且经常被引用的项目是保序服务器集群 (COPS) [LLOYD11]和 Eiger [LLOYD13]。这两个项目都通过库(实现为用户连接的前端服务器)实现因果关系,并跟踪依赖关系以确保一致性。COPS 通过关键版本跟踪依赖关系,而 Eiger 则建立操作顺序(Eiger 中的操作可以依赖于其他节点上执行的操作;例如,在多分区事务的情况下)。这两个项目都不会像最终一致存储那样公开无序操作。相反,它们检测并处理冲突:在 COPS 中,这是通过检查密钥顺序并使用特定于应用程序的函数来完成的,而 Eiger 则实现最后写入获胜规则。

The two prominent and frequently cited projects implementing causal consistency are Clusters of Order-Preserving Servers (COPS) [LLOYD11] and Eiger [LLOYD13]. Both projects implement causality through a library (implemented as a frontend server that users connect to) and track dependencies to ensure consistency. COPS tracks dependencies through key versions, while Eiger establishes operation order instead (operations in Eiger can depend on operations executed on the other nodes; for example, in the case of multipartition transactions). Both projects do not expose out-of-order operations like eventually consistent stores might do. Instead, they detect and handle conflicts: in COPS, this is done by checking the key order and using application-specific functions, while Eiger implements the last-write-wins rule.

矢量时钟

Vector clocks

建立即使消息无序传递,因果顺序也允许系统重建事件序列,填补消息之间的空白,并避免在某些消息仍然丢失的情况下发布操作结果。例如,如果{M1(∅, t1), M2(M1, t2), M3(M2, t3)}每条指定其依赖项的消息都因果相关并且无序传播,则进程会缓冲它们,直到它可以收集所有操作依赖项并恢复其因果顺序[KINGSBURY18b]。许多数据库,例如 Dynamo [DECANDIA07]和 Riak [SHEEHY10a],使用矢量时钟 [LAMPORT78] [MATTERN88]来建立因果顺序。

Establishing causal order allows the system to reconstruct the sequence of events even if messages are delivered out of order, fill the gaps between the messages, and avoid publishing operation results in case some messages are still missing. For example, if messages {M1(∅, t1), M2(M1, t2), M3(M2, t3)}, each specifying their dependencies, are causally related and were propagated out of order, the process buffers them until it can collect all operation dependencies and restore their causal order [KINGSBURY18b]. Many databases, for example, Dynamo [DECANDIA07] and Riak [SHEEHY10a], use vector clocks [LAMPORT78] [MATTERN88] for establishing causal order.

矢量时钟 是一种用于在事件之间建立偏、检测和解决事件链之间的分歧的结构。使用矢量时钟,我们可以模拟公共时间、全局状态,并将异步事件表示为同步事件。进程维护逻辑时钟向量,每个进程一个时钟。每个时钟都从初始值开始,并在每次新事件到达时递增(例如,发生写入)。当从其他进程接收时钟向量时,进程将其本地向量更新为接收到的向量中每个进程的最高时钟值(即发送节点所见过的最高时钟值)。

A vector clock is a structure for establishing a partial order between the events, detecting and resolving divergence between the event chains. With vector clocks, we can simulate common time, global state, and represent asynchronous events as synchronous ones. Processes maintain vectors of logical clocks, with one clock per process. Every clock starts at the initial value and is incremented every time a new event arrives (for example, a write occurs). When receiving clock vectors from other processes, a process updates its local vector to the highest clock values per process from the received vectors (i.e., highest clock values the transmitting node has ever seen).

使用矢量时钟来解决冲突,每当我们写入数据库时​​,我们首先检查写入的键的值是否已在本地存在。如果先前的值已经存在,我们将新版本附加到版本向量并建立两次写入之间的因果关系。否则,我们启动一个新的事件链并使用单个版本初始化该值。

To use vector clocks for conflict resolution, whenever we make a write to the database, we first check if the value for the written key already exists locally. If the previous value already exists, we append a new version to the version vector and establish the causal relationship between the two writes. Otherwise, we start a new chain of events and initialize the value with a single version.

我们讨论了访问共享内存寄存器和挂钟操作顺序方面的一致性,并在讨论顺序一致性时首先提到了潜在的副本分歧。由于只需对同一内存位置的写入操作进行排序,因此如果值是独立的,我们就不会陷入写入冲突的情况[LAMPORT79]

We were talking about consistency in terms of access to shared memory registers and wall-clock operation ordering, and first mentioned potential replica divergence when talking about sequential consistency. Since only write operations to the same memory location have to be ordered, we cannot end up in a situation where we have a write conflict if values are independent [LAMPORT79].

由于我们正在寻找一种能够提高可用性和性能的一致性模型,因此我们必须允许副本不仅通过提供陈旧的读取服务,而且还通过接受潜在冲突的写入来发散,因此允许系统创建两个独立的事件链。图 11-9显示了这样一种分歧:从一个副本的角度来看,我们看到历史记录为1, 5, 7, 8,而另一个副本则报告为1, 5, 3。Riak 允许用户查看并解决不同的历史记录[DAILY13]

Since we’re looking for a consistency model that would improve availability and performance, we have to allow replicas to diverge not only by serving stale reads but also by accepting potentially conflicting writes, so the system is allowed to create two independent chains of events. Figure 11-9 shows such a divergence: from the perspective of one replica, we see history as 1, 5, 7, 8 and the other one reports 1, 5, 3. Riak allows users to see and resolve divergent histories [DAILY13].

数据库1109
图 11-9。因果一致性下的分歧历史
笔记

为了实现因果一致性,我们必须存储因果历史记录,添加垃圾收集,并要求用户在发生冲突时协调不同的历史记录。矢量时钟可以告诉您冲突已经发生,但不会准确地提出如何解决它,因为解析语义通常是特定于应用程序的。因此,一些最终一致的数据库,例如例如,Apache Cassandra,不要按因果关系对操作进行排序,而是使用最后写入获胜规则来解决冲突[ELLIS13]

To implement causal consistency, we have to store causal history, add garbage collection, and ask the user to reconcile divergent histories in case of a conflict. Vector clocks can tell you that the conflict has occurred, but do not propose exactly how to resolve it, since resolution semantics are often application-specific. Because of that, some eventually consistent databases, for example, Apache Cassandra, do not order operations causally and use the last-write-wins rule for conflict resolution instead [ELLIS13].

会话模型

Session Models

思维关于值传播方面的一致性对于数据库开发人员很有用,因为它有助于理解和施加所需的数据不变量,但从客户端的角度来看,有些事情更容易理解和解释。我们可以从单个客户端而不是多个客户端的角度来看待我们的分布式系统。

Thinking about consistency in terms of value propagation is useful for database developers, since it helps to understand and impose required data invariants, but some things are easier understood and explained from the client point of view. We can look at our distributed system from the perspective of a single client instead of multiple clients.

会话模型 [VIOTTI16](也称为以客户端为中心的一致性模型[TANENBAUM06])有助于从客户端角度推理分布式系统的状态:每个客户端在发出读写操作时如何观察系统的状态。

Session models [VIOTTI16] (also called client-centric consistency models [TANENBAUM06]) help to reason about the state of the distributed system from the client perspective: how each client observes the state of the system while issuing read and write operations.

如果我们到目前为止讨论的其他一致性模型侧重于解释并发客户端存在时的操作顺序,那么以客户端为中心的一致性则侧重于单个客户端如何与系统交互。我们仍然假设每个客户端的操作都是顺序的:它必须完成一项操作才能开始执行下一项操作。如果客户端在操作完成之前崩溃或失去与服务器的连接,我们不会对未完成操作的状态做出任何假设。

If other consistency models we discussed so far focus on explaining operation ordering in the presence of concurrent clients, client-centric consistency focuses on how a single client interacts with the system. We still assume that each client’s operations are sequential: it has to finish one operation before it can start executing the next one. If the client crashes or loses connection to the server before its operation completes, we do not make any assumptions about the state of incomplete operations.

在分布式系统中,客户端通常可以连接到任何可用的副本,如果最近针对一个副本的写入结果没有传播到另一个副本,则客户端可能无法观察其所做的状态更改。

In a distributed system, clients often can connect to any available replica and, if the results of the recent write against one replica did not propagate to the other one, the client might not be able to observe the state change it has made.

合理的期望之一是客户端发出的每个写入对其可见。这假设在读-自己-写一致性模型下成立,该模型指出,在同一或另一个副本上写入之后的每个读操作都必须观察更新的值。例如,read(x)立即执行write(x,V)将返回值V

One of the reasonable expectations is that every write issued by the client is visible to it. This assumption holds under the read-own-writes consistency model, which states that every read operation following the write on the same or the other replica has to observe the updated value. For example, read(x) that was executed immediately after write(x,V) will return the value V.

单调读取模型限制值可见性,并指出如果read(x)已观察到该值V,则以下读取必须观察至少与最近的值V或某个较晚的值相同的值。

The monotonic reads model restricts the value visibility and states that if the read(x) has observed the value V, the following reads have to observe a value at least as recent as V or some later value.

单调写入模型假设源自同一客户端的值按照该客户端执行它们的顺序出现。如果根据客户端会话顺序,write(x,V2)在 后 write(x,V1)进行的,则它们的效果必须以相同的顺序(即,V1首先,然后V2)对所有其他进程可见。如果没有这个假设,旧数据可能会“复活”,从而导致数据丢失。

The monotonic writes model assumes that values originating from the same client appear in the order this client has executed them. If, according to the client session order, write(x,V2) was made after write(x,V1), their effects have to become visible in the same order (i.e., V1 first, and then V2) to all other processes. Without this assumption, old data can be “resurrected,” resulting in data loss.

写入-跟随-读取(有时称为会话因果关系)确保写入顺序在先前读取操作观察到的写入之后。例如,如果在that 返回write(x,V2)后排序,则将在 后排序。read(x)V1write(x,V2) write(x,V1)

Writes-follow-reads (sometimes referred as session causality) ensures that writes are ordered after writes that were observed by previous read operations. For example, if write(x,V2) is ordered after read(x) that has returned V1, write(x,V2) will be ordered after write(x,V1).

警告

会话模型不对不同进程(客户端)或不同逻辑会话所做的操作做出任何假设[ TANENBAUM14 ]。这些模型从单个进程的角度描述操作顺序。然而,系统中的每个进程都必须拥有相同的保证。换句话说,如果可以读取自己的写入,也应该能够读取自己的写入。P1P2

Session models make no assumptions about operations made by different processes (clients) or from the different logical session [TANENBAUM14]. These models describe operation ordering from the point of view of a single process. However, the same guarantees have to hold for every process in the system. In other words, if P1 can read its own writes, P2 should be able to read its own writes, too.

将单调读取、单调写入和读取自己写入相结合可实现流水线 RAM (PRAM) 一致性[LIPTON88] [BRZEZINSKI03],也称为 FIFO 一致性。PRAM 保证源自一个进程的写操作将按照该进程执行的顺序传播。与顺序一致性不同,来自不同进程的写入可以以不同的顺序观察。

Combining monotonic reads, monotonic writes, and read-own-writes gives Pipelined RAM (PRAM) consistency [LIPTON88] [BRZEZINSKI03], also known as FIFO consistency. PRAM guarantees that write operations originating from one process will propagate in the order they were executed by this process. Unlike under sequential consistency, writes from different processes can be observed in different order.

以客户端为中心的一致性模型描述的属性是理想的,并且在大多数情况下,分布式系统开发人员使用它们来验证他们的系统并简化其使用。

The properties described by client-centric consistency models are desirable and, in the majority of cases, are used by distributed systems developers to validate their systems and simplify their usage.

最终一致性

Eventual Consistency

同步在多处理器编程和分布式系统中都是昂贵的。正如我们在“一致性模型”中讨论的那样,我们可以放宽一致性保证并使用允许节点之间存在一定差异的模型。例如,顺序一致性允许读取以不同的速度传播。

Synchronization is expensive, both in multiprocessor programming and in distributed systems. As we discussed in “Consistency Models”, we can relax consistency guarantees and use models that allow some divergence between the nodes. For example, sequential consistency allows reads to be propagated at different speeds.

在最终一致性下,更新通过系统异步传播。正式地,它指出如果没有对数据项执行额外的更新,最终所有访问都会返回最新的写入值[VOGELS09]。如果发生冲突,最新值的概念可能会改变,因为来自分歧副本的值使用冲突解决策略进行协调,例如最后写入获胜或使用矢量时钟(请参阅“矢量时钟”)。

Under eventual consistency, updates propagate through the system asynchronously. Formally, it states that if there are no additional updates performed against the data item, eventually all accesses return the latest written value [VOGELS09]. In case of a conflict, the notion of latest value might change, as the values from diverged replicas are reconciled using a conflict resolution strategy, such as last-write-wins or using vector clocks (see “Vector clocks”).

最终是一个有趣的术语来描述价值传播,因为它没有指定它必须发生的硬时间限制。如果送货服务仅提供“最终”保证,那么听起来并不值得信赖。然而,在实践中,这种方法效果很好,现在许多数据库都被描述为最终一致的

Eventually is an interesting term to describe value propagation, since it specifies no hard time bound in which it has to happen. If the delivery service provides nothing more than an “eventually” guarantee, it doesn’t sound like it can be relied upon. However, in practice, this works well, and many databases these days are described as eventually consistent.

可调一致性

Tunable Consistency

最终一致性系统有时用 CAP 术语来描述:您可以用可用性换取一致性,反之亦然(参见“臭名昭著的 CAP”)。从服务器端的角度来看,最终一致的系统通常实现可调一致性,其中使用三个变量来复制、读取和写入数据:

Eventually consistent systems are sometimes described in CAP terms: you can trade availability for consistency or vice versa (see “Infamous CAP”). From the server-side perspective, eventually consistent systems usually implement tunable consistency, where data is replicated, read, and written using three variables:

复制因子N
Replication Factor N

将存储数据副本的节点数量。

Number of nodes that will store a copy of data.

写入一致性W
Write Consistency W

必须确认写入才能成功的节点数量。

Number of nodes that have to acknowledge a write for it to succeed.

读取一致性R
Read Consistency R

必须响应读取操作才能成功的节点数。

Number of nodes that have to respond to a read operation for it to succeed.

选择一致性级别其中(R + W > N),系统可以保证返回最近的写入值,因为读集和写集之间总是存在重叠。例如,如果N = 3W = 2R = 2,系统可以容忍仅一个节点的故障。三分之二的节点必须确认写入。在理想情况下,系统还将写入异步复制到第三个节点。如果第三个节点出现故障,反熵机制(参见第 12 章)最终会传播它。

Choosing consistency levels where (R + W > N), the system can guarantee returning the most recent written value, because there’s always an overlap between read and write sets. For example, if N = 3, W = 2, and R = 2, the system can tolerate a failure of just one node. Two nodes out of three must acknowledge the write. In the ideal scenario, the system also asynchronously replicates the write to the third node. If the third node is down, anti-entropy mechanisms (see Chapter 12) eventually propagate it.

在读取过程中,三个副本中的两个必须可用来满足请求,以便我们以一致的结果进行响应。任何节点组合都会为我们提供至少一个具有给定键的最新记录的节点。

During the read, two replicas out of three have to be available to serve the request for us to respond with consistent results. Any combination of nodes will give us at least one node that will have the most up-to-date record for a given key.

提示

执行写入时,协调器应将其提交给节点,但在继续之前N只能等待节点(或者在协调器也是副本的情况下)。其余的写入操作可以异步完成或失败。类似地,在执行读取时,协调器必须至少收集响应。一些数据库使用推测执行并提交额外的读取请求以减少协调器响应延迟。这意味着如果最初提交的读取请求之一失败或到达缓慢,则可以将推测请求计入其中。WW - 1 RR

When performing a write, the coordinator should submit it to N nodes, but can wait for only W nodes before it proceeds (or W - 1 in case the coordinator is also a replica). The rest of the write operations can complete asynchronously or fail. Similarly, when performing a read, the coordinator has to collect at least R responses. Some databases use speculative execution and submit extra read requests to reduce coordinator response latency. This means if one of the originally submitted read requests fails or arrives slowly, speculative requests can be counted toward R instead.

写入密集型系统有时可能会选择W = 1R = N,这允许写入在成功之前仅由一个节点确认,但需要所有副本(甚至可能失败的副本)可用于读取。W = N,组合也是如此R = 1:可以从任何节点读取最新值,只要在应用于所有副本后写入才成功。

Write-heavy systems may sometimes pick W = 1 and R = N, which allows writes to be acknowledged by just one node before they succeed, but would require all the replicas (even potentially failed ones) to be available for reads. The same is true for the W = N, R = 1 combination: the latest value can be read from any node, as long as writes succeed only after being applied on all replicas.

提高读取或写入一致性级别会增加延迟,并提高请求期间节点可用性的要求。减少它们可以提高系统可用性,但会牺牲一致性。

Increasing read or write consistency levels increases latencies and raises requirements for node availability during requests. Decreasing them improves system availability while sacrificing consistency.

见证复制品

Witness Replicas

使用法定人数读取一致性有助于提高可用性:即使某些节点发生故障,数据库系统仍然可以接受读取并提供写入服务。多数要求保证,由于任何多数中至少有一个节点重叠,因此任何仲裁读取都会观察到最近完成的仲裁写入。然而,使用复制和多数会增加存储成本:我们必须在每个副本上存储数据的副本。如果我们的复制因子是五,我们必须存储五个副本。

Using quorums for read consistency helps to improve availability: even if some of the nodes are down, a database system can still accept reads and serve writes. The majority requirement guarantees that, since there’s an overlap of at least one node in any majority, any quorum read will observe the most recent completed quorum write. However, using replication and majorities increases storage costs: we have to store a copy of the data on each replica. If our replication factor is five, we have to store five copies.

我们可以通过使用称为“见证副本”的概念来降低存储成本。我们可以将副本拆分为副本子集见证子集,而不是在每个副本上存储记录的副本。复制副本仍然像以前一样保存数据记录。在正常操作下,见证副本仅存储指示写入操作发生这一事实的记录。但是,当副本数量太少时,可能会出现这种情况。例如,如果我们有三个副本副本和两个见证副本,并且两个副本副本发生故障,那么我们最终会得到一个副本和两个见证副本的法定人数。

We can improve storage costs by using a concept called witness replicas. Instead of storing a copy of the record on each replica, we can split replicas into copy and witness subsets. Copy replicas still hold data records as previously. Under normal operation, witness replicas merely store the record indicating the fact that the write operation occurred. However, a situation might occur when the number of copy replicas is too low. For example, if we have three copy replicas and two witness ones, and two copy replicas go down, we end up with a quorum of one copy and two witness replicas.

在写入超时或复制副本失败的情况下,可以升级见证副本以临时存储记录,以代替失败或超时的复制副本。一旦原始副本恢复,升级的副本就可以恢复到之前的状态,或者恢复的副本可以成为见证人。

In cases of write timeouts or copy replica failures, witness replicas can be upgraded to temporarily store the record in place of failed or timed-out copy replicas. As soon as the original copy replicas recover, upgraded replicas can revert to their previous state, or recovered replicas can become witnesses.

让我们考虑一个具有三个节点的复制系统,其中两个节点保存数据副本,第三个节点充当见证人:[1c, 2c, 3w]。我们尝试进行写入,但2c暂时不可用,无法完成操作。在这种情况下,我们将记录临时存储在见证副本上3w。每当2c恢复时,修复机制都可以使其恢复到最新状态并从见证人中删除冗余副本。

Let’s consider a replicated system with three nodes, two of which are holding copies of data and the third serves as a witness: [1c, 2c, 3w]. We attempt to make a write, but 2c is temporarily unavailable and cannot complete the operation. In this case, we temporarily store the record on the witness replica 3w. Whenever 2c comes back up, repair mechanisms can bring it back up-to-date and remove redundant copies from witnesses.

在不同的场景中,我们可以尝试执行读取,并且记录存在于1c和上3w,但不存在于 上2c。由于任何两个副本都足以构成仲裁,因此如果大小为 2 的节点的任何子集可用,无论是两个副本副本[1c, 2c],还是一个副本副本和一个见证人[1c, 3w][2c, 3w],我们都可以保证提供一致的结果。如果我们从 读取[1c, 2c],我们会从中获取最新记录1c并可以将其复制到2c,因为那里缺少值。如果只有[2c, 3w]可用,则可以从 中获取最新记录3w。要恢复原始配置并使其2c更新,可以将记录复制到其中,并从见证中删除。

In a different scenario, we can attempt to perform a read, and the record is present on 1c and 3w, but not on 2c. Since any two replicas are enough to constitute a quorum, if any subset of nodes of size two is available, whether it’s two copy replicas [1c, 2c], or one copy replica and one witness [1c, 3w] or [2c, 3w], we can guarantee to serve consistent results. If we read from [1c, 2c], we fetch the latest record from 1c and can replicate it to 2c, since the value is missing there. In case only [2c, 3w] are available, the latest record can be fetched from 3w. To restore the original configuration and bring 2c up-to-date, the record can be replicated to it, and removed from the witness.

更一般地说,假设我们遵循两个规则,拥有n副本和m见证副本具有与副本相同的可用性保证:n + m

More generally, having n copy and m witness replicas has same availability guarantees as n + m copies, given that we follow two rules:

  • 读取和写入操作是使用多数执行的(即,有N/2 + 1 参与者

  • Read and write operations are performed using majorities (i.e., with N/2 + 1 participants)

  • 该仲裁中的至少一个副本必须是副本之一

  • At least one of the replicas in this quorum is necessarily a copy one

这是可行的,因为数据保证位于副本或见证副本上。如果发生故障,复制副本会通过修复机制保持最新状态,而见证副本会临时存储数据。

This works because data is guaranteed to be either on the copy or witness replicas. Copy replicas are brought up-to-date by the repair mechanism in case of a failure, and witness replicas store the data in the interim.

使用见证副本有助于降低存储成本,同时保持一致性不变量。这种方法有多种实现方式;例如,Spanner [CORBETT12]Apache Cassandra

Using witness replicas helps to reduce storage costs while preserving consistency invariants. There are several implementations of this approach; for example, Spanner [CORBETT12] and Apache Cassandra.

强最终一致性和 CRDT

Strong Eventual Consistency and CRDTs

我们已经讨论了几种强一致性模型,例如线性化和串行化,以及弱一致性的一种形式:最终一致性。两者之间可能的中间立场是强最终一致性,它提供了两种模型的一些优点。在这种模型下,允许更新延迟或无序地传播到服务器,但是当所有更新最终传播到目标节点时,它们之间的冲突可以得到解决,并且可以将它们合并以产生相同的有效状态[GOMES17 ]

We’ve discussed several strong consistency models, such as linearizability and serializability, and a form of weak consistency: eventual consistency. A possible middle ground between the two, offering some benefits of both models, is strong eventual consistency. Under this model, updates are allowed to propagate to servers late or out of order, but when all updates finally propagate to target nodes, conflicts between them can be resolved and they can be merged to produce the same valid state [GOMES17].

在下面在某些情况下,我们可以通过允许操作保留附加状态来放宽一致性要求,从而允许在执行后协调(换句话说,合并)分歧的状态。这种方法最突出的例子之一是在 Redis 等中实现的无冲突复制数据类型(CRDT,[SHAPIRO11a] ) [BIYIKOGLU13]

Under some conditions, we can relax our consistency requirements by allowing operations to preserve additional state that allows the diverged states to be reconciled (in other words, merged) after execution. One of the most prominent examples of such an approach is Conflict-Free Replicated Data Types (CRDTs, [SHAPIRO11a]) implemented, for example, in Redis [BIYIKOGLU13].

CRDT 是专门的数据结构,可以排除冲突的存在,并允许以任何顺序应用对这些数据类型的操作而不改变结果。这个属性在分布式系统中非常有用。例如,在使用无冲突复制计数器的多节点系统中,我们可以独立地增加每个节点上的计数器值,即使它们由于网络分区而无法相互通信。一旦通信恢复,所有节点的结果就可以协调,并且在分区期间应用的任何操作都不会丢失。

CRDTs are specialized data structures that preclude the existence of conflict and allow operations on these data types to be applied in any order without changing the result. This property can be extremely useful in a distributed system. For example, in a multinode system that uses conflict-free replicated counters, we can increment counter values on each node independently, even if they cannot communicate with one another due to a network partition. As soon as communication is restored, results from all nodes can be reconciled, and none of the operations applied during the partition will be lost.

这使得 CRDT 在最终一致的系统中非常有用,因为此类系统中的副本状态允许暂时发散。副本可以在本地执行操作,无需事先与其他节点同步,并且操作最终会传播到所有其他副本,可能会乱序。CRDT 允许我们从局部单独状态或操作序列重建完整的系统状态。

This makes CRDTs useful in eventually consistent systems, since replica states in such systems are allowed to temporarily diverge. Replicas can execute operations locally, without prior synchronization with other nodes, and operations eventually propagate to all other replicas, potentially out of order. CRDTs allow us to reconstruct the complete system state from local individual states or operation sequences.

CRDT 最简单的示例是基于操作的交换复制数据类型 (CmRDT)。为了使 CmRDT 发挥作用,我们需要允许的操作为:

The simplest example of CRDTs is operation-based Commutative Replicated Data Types (CmRDTs). For CmRDTs to work, we need the allowed operations to be:

无副作用
Side-effect free

他们的应用程序不会改变系统状态。

Their application does not change the system state.

交换律
Commutative

参数顺序并不重要:x • y = y • x. 换句话说,是否x与 合并y,或y与 合并并不重要x

Argument order does not matter: x • y = y • x. In other words, it doesn’t matter whether x is merged with y, or y is merged with x.

因果顺序
Causally ordered

它们的成功交付取决于前提条件,该前提条件确保系统已达到操作可以应用的状态。

Their successful delivery depends on the precondition, which ensures that the system has reached the state the operation can be applied to.

例如,我们可以实现一个仅增长计数器。每个服务器可以保存一个状态向量,该状态向量由来自所有其他参与者的最后已知计数器更新组成,并用零初始化。每个服务器只允许修改自己在向量中的值。当传播更新时,该函数merge(state1, state2)会合并来自两个服务器的状态。

For example, we could implement a grow-only counter. Each server can hold a state vector consisting of last known counter updates from all other participants, initialized with zeros. Each server is only allowed to modify its own value in the vector. When updates are propagated, the function merge(state1, state2) merges the states from the two servers.

例如,我们有三个服务器,初始化了初始状态向量:

For example, we have three servers, with initial state vectors initialized:

节点 1: 节点 2: 节点 3:
[0, 0, 0] [0, 0, 0] [0, 0, 0]
Node 1:          Node 2:          Node 3:
[0, 0, 0]        [0, 0, 0]        [0, 0, 0]

如果我们更新第一个和第三个节点上的计数器,它们的状态将发生如下变化:

If we update counters on the first and third nodes, their states change as follows:

节点 1: 节点 2: 节点 3:
[1, 0, 0] [0, 0, 0] [0, 0, 1]
Node 1:          Node 2:          Node 3:
[1, 0, 0]        [0, 0, 0]        [0, 0, 1]

当更新传播时,我们使用合并函数通过选择每个槽的最大值来组合结果:

When updates propagate, we use a merge function to combine the results by picking the maximum value for each slot:

节点 1(传播节点 3 状态向量):
合并([1,0,0],[0,0,1])=[1,0,1]

节点 2(传播节点 1 状态向量):
合并([0,0,0],[1,0,0])=[1,0,0]

节点 2(传播节点 3 状态向量):
合并([1,0,0],[0,0,1])=[1,0,1]

节点 3(传播节点 1 状态向量):
合并([0,0,1],[1,0,0])=[1,0,1]
Node 1 (Node 3 state vector propagated):
merge([1, 0, 0], [0, 0, 1]) = [1, 0, 1]

Node 2 (Node 1 state vector propagated):
merge([0, 0, 0], [1, 0, 0]) = [1, 0, 0]

Node 2 (Node 3 state vector propagated):
merge([1, 0, 0], [0, 0, 1]) = [1, 0, 1]

Node 3 (Node 1 state vector propagated):
merge([0, 0, 1], [1, 0, 0]) = [1, 0, 1]

为了确定当前向量状态,计算所有槽中的值的总和:sum([1, 0, 1]) = 2。合并函数是可交换的。由于服务器只允许更新自己的值,并且这些值是独立的,因此不需要额外的协调。

To determine the current vector state, the sum of values in all slots is computed: sum([1, 0, 1]) = 2. The merge function is commutative. Since servers are only allowed to update their own values and these values are independent, no additional coordination is required.

通过使用由两个向量组成的有效负载,可以生成支持递增和递减的正负计数器P(PN-Counter): ,节点用于递增,以及N,它们存储递减。在更大的系统中,为了避免传播巨大的向量,我们可以使用超级同行。超级节点复制计数器状态并有助于避免持续的点对点聊天[SHAPIRO11b]

It is possible to produce a Positive-Negative-Counter (PN-Counter) that supports both increments and decrements by using payloads consisting of two vectors: P, which nodes use for increments, and N, where they store decrements. In a larger system, to avoid propagating huge vectors, we can use super-peers. Super-peers replicate counter states and help to avoid constant peer-to-peer chatter [SHAPIRO11b].

为了保存和复制值,我们可以使用 寄存器. 寄存器的最简单版本是最后写入获胜寄存器(LWW 寄存器),它存储附加到每个值的唯一的、全局排序的时间戳以解决冲突。如果发生写入冲突,我们仅保留时间戳较大的写入。这里的合并操作(选择具有最大时间戳的值)也是可交换的,因为它依赖于时间戳。如果我们不能允许丢弃值,我们可以提供特定于应用程序的合并逻辑并使用多值寄存器,该寄存器存储所有写入的值并允许应用程序选择正确的值。

To save and replicate values, we can use registers. The simplest version of the register is the last-write-wins register (LWW register), which stores a unique, globally ordered timestamp attached to each value to resolve conflicts. In case of a conflicting write, we preserve only the one with the larger timestamp. The merge operation (picking the value with the largest timestamp) here is also commutative, since it relies on the timestamp. If we cannot allow values to be discarded, we can supply application-specific merge logic and use a multivalue register, which stores all values that were written and allows the application to pick the right one.

其他CRDT 的示例是无序仅增长集 (G-Set)。每个节点都维护其本地状态并可以向其添加元素。添加元素会产生一个有效的集合。合并两个集合也是一种交换运算。与计数器类似,我们可以使用两个集合来支持添加和删除。在这种情况下,我们必须保留一个不变量:只有添加集中包含的值才能添加到删除集中。为了重建集合的当前状态,从添加集合[SHAAPIRO11b]中减去删除集合中包含的所有元素。

Another example of CRDTs is an unordered grow-only set (G-Set). Each node maintains its local state and can append elements to it. Adding elements produces a valid set. Merging two sets is also a commutative operation. Similar to counters, we can use two sets to support both additions and removals. In this case, we have to preserve an invariant: only the values contained in the addition set can be added into the removal set. To reconstruct the current state of the set, all elements contained in the removal set are subtracted from the addition set [SHAPIRO11b].

组合更复杂结构的无冲突类型的一个示例是无冲突复制 JSON 数据类型,允许对具有列表和映射类型的深度嵌套 JSON 文档进行插入、删除和赋值等修改。该算法在客户端执行合并操作,并且不需要以任何特定顺序传播操作[KLEPPMANN14]

An example of a conflict-free type that combines more complex structures is a conflict-free replicated JSON data type, allowing modifications such as insertions, deletions, and assignments on deeply nested JSON documents with list and map types. This algorithm performs merge operations on the client side and does not require operations to be propagated in any specific order [KLEPPMANN14].

CRDT 为我们提供了相当多的可能性,我们可以看到更多的数据存储使用这个概念来提供强最终一致性(SEC)。这是一个强大的概念,我们可以将其添加到构建容错分布式系统的工具库中。

There are quite a few possibilities CRDTs provide us with, and we can see more data stores using this concept to provide Strong Eventual Consistency (SEC). This is a powerful concept that we can add to our arsenal of tools for building fault-tolerant distributed systems.

概括

Summary

容错系统使用复制来提高可用性:即使某些进程失败或无响应,整个系统也可以继续正常运行。然而,保持多个副本同步需要额外的协调。

Fault-tolerant systems use replication to improve availability: even if some processes fail or are unresponsive, the system as a whole can continue functioning correctly. However, keeping multiple copies in sync requires additional coordination.

我们讨论了几种单操作一致性模型,从保证最多的模型到保证最少的模型排序:2

We’ve discussed several single-operation consistency models, ordered from the one with the most guarantees to the one with the least:2

线性度
Linearizability

操作仿佛瞬间完成,维持实时操作秩序。

Operations appear to be applied instantaneously, and the real-time operation order is maintained.

顺序一致性
Sequential consistency

操作效果以某种总顺序传播,并且该顺序与各个进程执行它们的顺序一致。

Operation effects are propagated in some total order, and this order is consistent with the order they were executed by the individual processes.

因果一致性
Causal consistency

因果相关操作的效果对于所有进程都以相同的顺序可见

Effects of the causally related operations are visible in the same order to all processes.

PRAM/FIFO 一致性
PRAM/FIFO consistency

操作效果按照各个进程执行的顺序可见。可以以不同的顺序观察来自不同进程的写入。

Operation effects become visible in the same order they were executed by individual processes. Writes from different processes can be observed in different orders.

之后,我们讨论了多种会话模型:

After that, we discussed multiple session models:

读自己写
Read-own-writes

读取操作反映了先前的写入。写入在系统中传播,并可用于来自同一客户端的后续读取。

Read operations reflect the previous writes. Writes propagate through the system and become available for later reads that come from the same client.

单调读取
Monotonic reads

任何已观察到某个值的读取都无法观察到比所观察到的值更旧的值。

Any read that has observed a value cannot observe a value that is older that the observed one.

单调写入
Monotonic writes

来自同一客户端的写入按照该客户端发出的顺序传播到其他客户端。

Writes coming from the same client propagate to other clients in the order they were made by this client.

写入跟随读取
Writes-follow-reads

写入操作在写入操作之后进行排序,写入操作的效果由同一客户端执行的先前读取操作观察到。

Write operations are ordered after the writes whose effects were observed by the previous reads executed by the same client.

了解和理解这些概念可以帮助您了解底层系统的保障并将其用于应用程序开发。一致性模型描述了数据操作必须遵循的规则,但其范围仅限于特定系统。将保证较弱的系统堆叠在保证较强的系统之上,或者忽略底层系统的一致性影响可能会导致不可恢复的不一致和数据丢失。

Knowing and understanding these concepts can help you to understand the guarantees of the underlying systems and use them for application development. Consistency models describe rules that operations on data have to follow, but their scope is limited to a specific system. Stacking systems with weaker guarantees on top of ones with stronger guarantees or ignoring consistency implications of underlying systems may lead to unrecoverable inconsistencies and data loss.

我们还讨论了最终一致性可调一致性的概念。基于仲裁的系统使用多数来提供一致的数据。见证副本可用于降低存储成本。

We also discussed the concept of eventual and tunable consistency. Quorum-based systems use majorities to serve consistent data. Witness replicas can be used to reduce storage costs.

1仲裁在最终一致性存储的上下文中进行读取和写入,这将在“最终一致性”中进行更详细的讨论。

1 Quorum reads and writes in the context of eventually consistent stores, which are discussed in more detail in “Eventual Consistency”.

2这些简短定义仅供回顾,建议读者参考上下文的完整定义。

2 These short definitions are given for recap only, the reader is advised to refer to the complete definitions for context.

第12章反熵与传播

Chapter 12. Anti-Entropy and Dissemination

最多到目前为止,我们讨论的通信模式要么是点对点,要么是一对多(协调器和副本)。为了在整个系统中可靠地传播数据记录,我们需要传播节点可用并且能够到达其他节点,但即便如此,吞吐量也仅限于单台机器。

Most of the communication patterns we’ve been discussing so far were either peer-to-peer or one-to-many (coordinator and replicas). To reliably propagate data records throughout the system, we need the propagating node to be available and able to reach the other nodes, but even then the throughput is limited to a single machine.

快速可靠的传播可能不太适用于数据记录,而对于集群范围的元数据,例如成员资格信息(加入和离开节点)、节点状态、故障、模式更改等。包含此信息的消息通常不频繁且较小,但必须尽可能快速、可靠地传播。

Quick and reliable propagation may be less applicable to data records and more important for the cluster-wide metadata, such as membership information (joining and leaving nodes), node states, failures, schema changes, etc. Messages containing this information are generally infrequent and small, but have to be propagated as quickly and reliably as possible.

这样的通常可以使用三大组方法之一将更新传播到集群中的所有节点[DEMERS87]这些通信模式的示意图如图 12-1所示:

Such updates can generally be propagated to all nodes in the cluster using one of the three broad groups of approaches [DEMERS87]; schematic depictions of these communication patterns are shown in Figure 12-1:

  • a) 通知广播从一个进程到所有其他进程。

  • a) Notification broadcast from one process to all others.

  • b) 定期的点对点信息交换。同行成对连接并交换消息。

  • b) Periodic peer-to-peer information exchange. Peers connect pairwise and exchange messages.

  • c) 合作广播,其中消息接收者成为广播者,有助于更快、更可靠地传播信息。

  • c) Cooperative broadcast, where message recipients become broadcasters and help to spread the information quicker and more reliably.

数据库1201
图 12-1。广播 (a)、反熵 (b) 和八卦 (c)

将消息广播到所有其他进程是最直接的方法,当集群中的节点数量较少时,这种方法效果很好,但在大型集群中,由于节点数量较多,这种方法可能会变得昂贵,并且由于过度依赖单个进程而变得可靠。各个进程可能并不总是知道网络中所有其他进程的存在。此外,广播进程及其每个接收者都启动的时间必须有一定的重叠,这在某些情况下可能难以实现。

Broadcasting the message to all other processes is the most straightforward approach that works well when the number of nodes in the cluster is small, but in large clusters it can get expensive because of the number of nodes, and unreliable because of overdependence on a single process. Individual processes may not always know about the existence of all other processes in the network. Moreover, there has to be some overlap in time during which both the broadcasting process and each one of its recipients are up, which might be difficult to achieve in some cases.

为了放宽这些限制,我们可以假设某些更新可能无法传播。协调器将尽最大努力将消息传递给所有可用的参与者,然后反熵机制将使节点恢复同步,以防出现任何故障。这样,传递消息的责任由系统中的所有节点共同承担,并分为两个步骤:主传递和定期同步。

To relax these constraints, we can assume that some updates may fail to propagate. The coordinator will do its best and deliver the messages to all available participants, and then anti-entropy mechanisms will bring nodes back in sync in case there were any failures. This way, the responsibility for delivering messages is shared by all nodes in the system, and is split into two steps: primary delivery and periodic sync.

是表示系统无序程度的属性。在分布式系统中,熵表示节点之间状态分歧的程度。由于这种性质是不受欢迎的,并且其数量应保持在最低限度,因此有许多技术可以帮助处理熵。

Entropy is a property that represents the measure of disorder in the system. In a distributed system, entropy represents a degree of state divergence between the nodes. Since this property is undesired and its amount should be kept to a minimum, there are many techniques that help to deal with entropy.

反熵通常用于在主要交付机制失败的情况下使节点恢复最新状态。即使协调器在某个时刻发生故障,系统也可以继续正常运行,因为其他节点将继续传播信息。换句话说,反熵用于降低最终一致系统中的收敛时间界限。

Anti-entropy is usually used to bring the nodes back up-to-date in case the primary delivery mechanism has failed. The system can continue functioning correctly even if the coordinator fails at some point, since the other nodes will continue spreading the information. In other words, anti-entropy is used to lower the convergence time bounds in eventually consistent systems.

为了保持节点同步,反熵会触发后台或前台进程来比较和协调丢失或冲突的记录。后台反熵过程使用 Merkle 树等辅助结构并更新日志来识别分歧。前台反熵进程搭载读或写请求:提示切换、读修复等。

To keep nodes in sync, anti-entropy triggers a background or a foreground process that compares and reconciles missing or conflicting records. Background anti-entropy processes use auxiliary structures such as Merkle trees and update logs to identify divergence. Foreground anti-entropy processes piggyback read or write requests: hinted handoff, read repairs, etc.

如果复制系统中的副本出现分歧,为了恢复一致性并使它们恢复同步,我们必须通过成对比较副本状态来查找并修复丢失的记录。对于大型数据集,这可能非常昂贵:我们必须读取两个节点上的整个数据集,并通知副本有关尚未传播的最新状态更改。为了降低这种成本,我们可以考虑使副本过时的方法以及访问数据的模式。

If replicas diverge in a replicated system, to restore consistency and bring them back in sync, we have to find and repair missing records by comparing replica states pairwise. For large datasets, this can be very costly: we have to read the whole dataset on both nodes and notify replicas about more recent state changes that weren’t yet propagated. To reduce this cost, we can consider ways in which replicas can get out-of-date and patterns in which data is accessed.

阅读修复

Read Repair

最容易在读取期间检测副本之间的差异,因为此时我们可以联系副本,向每个副本请求查询状态,并查看它们的响应是否匹配。请注意,在这种情况下,我们不会查询每个副本上存储的整个数据集,并且我们将目标限制为仅客户端请求的数据。

It is easiest to detect divergence between the replicas during the read, since at that point we can contact replicas, request the queried state from each one of them, and see whether or not their responses match. Note that in this case we do not query an entire dataset stored on each replica, and we limit our goal to just the data that was requested by the client.

协调器执行分布式读取,乐观地假设副本是同步的并且具有相同的可用信息。如果副本发送不同的响应,协调器会将丢失的更新发送到丢失的副本。

The coordinator performs a distributed read, optimistically assuming that replicas are in sync and have the same information available. If replicas send different responses, the coordinator sends missing updates to the replicas where they’re missing.

这种机制称为读修复。它通常用于检测和消除不一致之处。在读修复期间,协调器节点向副本发出请求,等待它们的响应,并对它们进行比较。如果某些副本错过了最近的更新并且它们的响应不同,协调器会检测到不一致并将更新发送回副本[DECANDIA07]

This mechanism is called read repair. It is often used to detect and eliminate inconsistencies. During read repair, the coordinator node makes a request to replicas, waits for their responses, and compares them. In case some of the replicas have missed the recent updates and their responses differ, the coordinator detects inconsistencies and sends updates back to the replicas [DECANDIA07].

一些 Dynamo 风格的数据库选择取消联系所有副本的要求,转而使用可调一致性级别。为了返回一致的结果,我们不必联系并修复所有副本,而只需联系和修复满足一致性级别的节点数量。要是我们进行仲裁读取和写入,我们仍然得到一致的结果,但某些副本仍然可能不包含所有写入。

Some Dynamo-style databases choose to lift the requirement of contacting all replicas and use tunable consistency levels instead. To return consistent results, we do not have to contact and repair all the replicas, but only the number of nodes that satisfies the consistency level. If we do quorum reads and writes, we still get consistent results, but some of the replicas still might not contain all the writes.

修复可以作为阻塞异步操作来实现。在阻塞读修复期间,原始客户端请求必须等待协调器“修复”副本。异步读修复只是安排一个可以在结果返回给用户后执行的任务。

Read repair can be implemented as a blocking or asynchronous operation. During blocking read repair, the original client request has to wait until the coordinator “repairs” the replicas. Asynchronous read repair simply schedules a task that can be executed after results are returned to the user.

阻止读取修复可确保仲裁读取的读取单调性(请参阅“会话模型”):一旦客户端读取特定值,后续读取将返回至少与它所看到的值一样新的值,因为副本状态已修复。如果我们不使用仲裁进行读取,我们就会失去这种单调性保证,因为在后续读取时数据可能尚未传播到目标节点。同时,阻止读取修复会牺牲可用性,因为修复应由目标副本确认,并且在它们响应之前读取无法返回。

Blocking read repair ensures read monotonicity (see “Session Models”) for quorum reads: as soon as the client reads a specific value, subsequent reads return the value at least as recent as the one it has seen, since replica states were repaired. If we’re not using quorums for reads, we lose this monotonicity guarantee as data might have not been propagated to the target node by the time of a subsequent read. At the same time, blocking read repair sacrifices availability, since repairs should be acknowledged by the target replicas and the read cannot return until they respond.

为了准确检测副本响应之间哪些记录不同,某些数据库(例如 Apache Cassandra)使用带有合并侦听器的专用迭代器,该迭代器会重建合并结果与各个输入之间的差异。然后协调器使用其输出来通知副本有关丢失的数据。

To detect exactly which records differ between replica responses, some databases (for example, Apache Cassandra) use specialized iterators with merge listeners, which reconstruct differences between the merged result and individual inputs. Its output is then used by the coordinator to notify replicas about the missing data.

读修复假设副本大部分是同步的,我们不希望每个请求都会回退到阻塞修复。由于阻塞修复的读取单调性,我们还可以期望后续请求返回相同的一致结果,只要在此期间没有完成写入操作。

Read repair assumes that replicas are mostly in sync and we do not expect every request to fall back to a blocking repair. Because of the read monotonicity of blocking repairs, we can also expect subsequent requests to return the same consistent results, as long as there was no write operation that has completed in the interim.

摘要阅读

Digest Reads

反而为了向每个节点发出全读请求,协调器只能发出一个全读请求,并且只向其他副本发送摘要请求。摘要请求读取副本本地数据,并且不返回所请求数据的完整快照,而是计算此响应的哈希值。现在,协调器可以计算完整读取的哈希值,并将其与所有其他节点的摘要进行比较。如果所有摘要都匹配,则可以确信副本是同步的。

Instead of issuing a full read request to each node, the coordinator can issue only one full read request and send only digest requests to the other replicas. A digest request reads the replica-local data and, instead of returning a full snapshot of the requested data, it computes a hash of this response. Now, the coordinator can compute a hash of the full read and compare it to digests from all other nodes. If all the digests match, it can be confident that the replicas are in sync.

如果摘要不匹配,协调器不知道哪些副本在前面,哪些副本在后面。为了使滞后副本重新与其余节点同步,协调器必须向任何响应不同摘要的副本发出完整读取,比较它们的响应,协调数据,并将更新发送到滞后副本。

In case digests do not match, the coordinator does not know which replicas are ahead, and which ones are behind. To bring lagging replicas back in sync with the rest of the nodes, the coordinator has to issue full reads to any replicas that responded with different digests, compare their responses, reconcile the data, and send updates to the lagging replicas.

笔记

摘要通常使用非加密哈希函数(例如 MD5)来计算,因为必须快速计算它才能使“快乐路径”具有高性能。哈希值函数可能会发生冲突,但对于大多数现实世界的系统来说,它们的概率可以忽略不计。由于数据库通常使用不止一种反熵机制,因此我们可以预期,即使在不太可能发生哈希冲突的情况下,数据也将由不同的子系统进行协调。

Digests are usually computed using a noncryptographic hash function, such as MD5, since it has to be computed quickly to make the “happy path” performant. Hash functions can have collisions, but their probability is negligible for most real-world systems. Since databases often use more than just one anti-entropy mechanism, we can expect that, even in the unlikely event of a hash collision, data will be reconciled by the different subsystem.

提示切换

Hinted Handoff

其他反熵方法称为暗示切换 [DECANDIA07],一种写端修复机制。如果目标节点未能确认写入,则写入协调器或其中一个副本会存储一条特殊记录,称为提示,一旦目标节点恢复,该记录就会重播到目标节点。

Another anti-entropy approach is called hinted handoff [DECANDIA07], a write-side repair mechanism. If the target node fails to acknowledge the write, the write coordinator or one of the replicas stores a special record, called a hint, which is replayed to the target node as soon as it comes back up.

Apache Cassandra,除非ANY正在使用一致性级别[ELLIS11],否则提示写入不会计入复制因子(请参阅“可调一致性”),因为提示日志中的数据不可用于读取,并且仅用于帮助落后的参与者迎头赶上。

In Apache Cassandra, unless the ANY consistency level is in use [ELLIS11], hinted writes aren’t counted toward the replication factor (see “Tunable Consistency”), since the data in the hint log isn’t accessible for reads and is only used to help the lagging participants catch up.

一些数据库,例如Riak,使用草率的法定人数和暗示的交接。在草率仲裁的情况下,如果副本发生故障,写入操作可以使用节点列表中的其他健康节点,并且这些节点不必是所执行操作的目标副本。

Some databases, for example Riak, use sloppy quorums together with hinted handoff. With sloppy quorums, in case of replica failures, write operations can use additional healthy nodes from the node list, and these nodes do not have to be target replicas for the executed operations.

例如,假设我们有一个包含 节点 的五节点集群{A, B, C, D, E},其中{A, B, C}是已执行写入操作的副本,并且节点B已关闭。A作为查询的协调者,选择节点D来满足草率的仲裁并维持所需的可用性和持久性保证。现在,数据已复制到{A, D, C}. 然而,记录D在其元数据中会有提示,因为写入最初是为了B. 一旦B恢复,D将尝试将提示转发给它。一旦在 上重放提示B,就可以安全地将其删除,而不会减少副本总数[DECANDIA07]

For example, say we have a five-node cluster with nodes {A, B, C, D, E}, where {A, B, C} are replicas for the executed write operation, and node B is down. A, being the coordinator for the query, picks node D to satisfy the sloppy quorum and maintain the desired availability and durability guarantees. Now, data is replicated to {A, D, C}. However, the record at D will have a hint in its metadata, since the write was originally intended for B. As soon as B recovers, D will attempt to forward a hint back to it. Once the hint is replayed on B, it can be safely removed without reducing the total number of replicas [DECANDIA07].

在类似的情况下,如果节点{B, C}通过网络分区与集群的其余部分短暂分离,并且对 进行了草率的仲裁写入,则紧随此写入之后的{A, D, E}读取将不会观察到最新的读取[DOWNEY12]。换句话说,草率的仲裁会以一致性为代价来提高可用性。{B, C}

Under similar circumstances, if nodes {B, C} are briefly separated from the rest of the cluster by the network partition, and a sloppy quorum write was done against {A, D, E}, a read on {B, C}, immediately following this write, would not observe the latest read [DOWNEY12]. In other words, sloppy quorums improve availability at the cost of consistency.

默克尔树

Merkle Trees

自从读修复只能修复当前查询的数据上的不一致,我们应该使用不同的机制来查找和修复未主动查询的数据中的不一致。

Since read repair can only fix inconsistencies on the currently queried data, we should use different mechanisms to find and repair inconsistencies in the data that is not actively queried.

正如我们已经讨论过的,准确查找副本之间哪些行存在分歧需要成对交换和比较数据记录。这是非常不切实际且昂贵的。许多数据库采用Merkle 树 [MERKLE87]来降低协调成本。

As we already discussed, finding exactly which rows have diverged between the replicas requires exchanging and comparing the data records pairwise. This is highly impractical and expensive. Many databases employ Merkle trees [MERKLE87] to reduce the cost of reconciliation.

Merkle 树组成本地数据的紧凑哈希表示,构建哈希树。该哈希树的最低层是通过扫描保存数据记录的整个表并计算记录范围的哈希值来构建的。较高的树级别包含较低级别的散列的散列,构建层次表示,使我们能够通过比较散列来快速检测不一致性,递归地跟踪散列树节点以缩小不一致的范围。这可以通过逐级交换和比较子树,或者通过交换和比较整个树来完成。

Merkle trees compose a compact hashed representation of the local data, building a tree of hashes. The lowest level of this hash tree is built by scanning an entire table holding data records, and computing hashes of record ranges. Higher tree levels contain hashes of the lower-level hashes, building a hierarchical representation that allows us to quickly detect inconsistencies by comparing the hashes, following the hash tree nodes recursively to narrow down inconsistent ranges. This can be done by exchanging and comparing subtrees level-wise, or by exchanging and comparing entire trees.

图 12-2显示了 Merkle 树的组成。最低级别由数据记录范围的散列组成。每个较高级别的哈希值是通过对底层哈希值进行哈希处理来计算的,递归地重复此过程直至树根。

Figure 12-2 shows a composition of a Merkle tree. The lowest level consists of the hashes of data record ranges. Hashes for each higher level are computed by hashing underlying level hashes, repeating this process recursively up to the tree root.

数据库1202
图 12-2。默克尔树。灰色框代表数据记录范围。白色框代表哈希树层次结构。

为了确定两个副本之间是否存在不一致,我们只需要比较它们的 Merkle 树的根级哈希值。通过从上到下成对比较哈希值,可以定位节点之间存在差异的范围,并修复其中包含的数据记录。

To determine whether or not there’s an inconsistency between the two replicas, we only need to compare the root-level hashes from their Merkle trees. By comparing hashes pairwise from top to bottom, it is possible to locate ranges holding differences between the nodes, and repair data records contained in them.

由于Merkle树是从下到上递归计算的,所以数据的变化会触发整个子树的重新计算。树的大小(因此,交换消息的大小)与其精度(数据范围有多小和精确)之间也存在权衡。

Since Merkle trees are calculated recursively from the bottom to the top, a change in data triggers recomputation of the entire subtree. There’s also a trade-off between the size of a tree (consequently, sizes of exchanged messages) and its precision (how small and exact data ranges are).

位图版本向量

Bitmap Version Vectors

更多的最近关于这个主题的研究引入了位图版本向量 [GONÇALVES15],它可用于解析基于新近度的数据冲突:每个节点都保留本地发生或复制的操作的对等日志。在反熵过程中,会比较日志,并将缺失的数据复制到目标节点。

More recent research on this subject introduces bitmap version vectors [GONÇALVES15], which can be used to resolve data conflicts based on recency: each node keeps a per-peer log of operations that have occurred locally or were replicated. During anti-entropy, logs are compared, and missing data is replicated to the target node.

由节点协调的每个写入都被表示通过 :具有由节点协调的节点(i,n)本地序列号的事件。序列号以 开头,并且每次节点执行写操作时都会递增。ini1

Each write, coordinated by a node, is represented by a dot (i,n): an event with a node-local sequence number i coordinated by the node n. The sequence number i starts with 1 and is incremented each time the node executes a write operation.

跟踪副本状态,我们使用节点本地逻辑时钟。每个时钟代表一组点,代表该节点直接(由节点本身协调)或传递(由其他节点协调并从其他节点复制)看到的写入。

To track replica states, we use node-local logical clocks. Each clock represents a set of dots, representing writes this node has seen directly (coordinated by the node itself), or transitively (coordinated by and replicated from the other nodes).

在节点逻辑时钟中,节点本身协调的事件不会有间隙。如果某些写入未从其他节点复制,则时钟将包含间隙。为了使两个节点恢复同步,它们可以交换逻辑时钟,识别由缺失点表示的间隙,然后复制与它们相关的数据记录。为此,我们需要重建每个点所引用的数据记录。该信息是存储在点式因果容器(DCC)中,它将点映射到给定键的因果信息。这样,冲突解决就可以捕获写入之间的因果关系。

In the node logical clock, events coordinated by the node itself will have no gaps. If some writes aren’t replicated from the other nodes, the clock will contain gaps. To get two nodes back in sync, they can exchange logical clocks, identify gaps represented by the missing dots, and then replicate data records associated with them. To do this, we need to reconstruct the data records each dot refers to. This information is stored in a dotted causal container (DCC), which maps dots to causal information for a given key. This way, conflict resolution captures causal relationships between the writes.

图 12-3(改编自[GONÇALVES15])显示了系统中三个节点 、 和 的状态表示示例,从的角度来看,跟踪它已经看到了哪些值。每次写入或接收复制值时,都会更新此表。P1P2P3P2P2

Figure 12-3 (adapted from [GONÇALVES15]) shows an example of the state representation of three nodes in the system, P1, P2 and P3, from the perspective of P2, tracking which values it has seen. Each time P2 makes a write or receives a replicated value, it updates this table.

数据库1203
图 12-3。位图版本矢量示例

在复制期间,创建此状态的紧凑表示,并创建从节点标识符到一对最新值(到目前为止它已看到连续写入)的映射,以及一个位图,其中其他已看到的写入被编码为。这里意味着节点已经看到了直到第三个值的连续更新,并且它已经看到了相对于的第二、第三和第五位置上的值(即,它已经看到了序列号为、和 的值)。P21(3, 011012)P23568

During replication, P2 creates a compact representation of this state and creates a map from the node identifier to a pair of latest values, up to which it has seen consecutive writes, and a bitmap where other seen writes are encoded as 1. (3, 011012) here means that node P2 has seen consecutive updates up to the third value, and it has seen values on the second, third, and fifth position relative to 3 (i.e., it has seen the values with sequence numbers 5, 6, and 8).

在与其他节点交换期间,它将接收其他节点已看到的丢失更新。一旦系统中的所有节点都看到直到索引 的连续值i,版本向量就可以被截断到该索引。

During exchange with other nodes, it will receive the missing updates the other node has seen. As soon as all the nodes in the system have seen consecutive values up to the index i, the version vector can be truncated up to this index.

这种方法的优点是它捕获值写入之间的因果关系,并允许节点精确识别其他节点上丢失的数据点。一个可能的缺点是,如果节点长时间关闭,对等节点无法截断日志,因为一旦落后节点恢复,数据仍然必须复制到滞后节点。

An advantage of this approach is that it captures the causal relation between the value writes and allows nodes to precisely identify the data points missing on the other nodes. A possible downside is that, if the node was down for an extended time period, peer nodes can’t truncate the log, since data still has to be replicated to the lagging node once it comes back up.

八卦传播

Gossip Dissemination

群众始终是精神流行病的滋生地。

卡尔·荣格

Masses are always breeding grounds of psychic epidemics.

Carl Jung

涉及其他节点,并通过广播的范围和反熵的可靠性来传播更新,我们可以使用八卦协议。

To involve other nodes, and propagate updates with the reach of a broadcast and the reliability of anti-entropy, we can use gossip protocols.

八卦协议基于谣言在人类社会中如何传播或疾病如何在人群中传播的概率通信程序。谣言和流行病提供了相当说明性的方式来描述这些协议如何运作:谣言传播,而人们仍然有兴趣听到它们;疾病会不断传播,直到人群中不再有易感成员为止。

Gossip protocols are probabilistic communication procedures based on how rumors are spread in human society or how diseases propagate in the population. Rumors and epidemics provide rather illustrative ways to describe how these protocols work: rumors spread while the population still has an interest in hearing them; diseases propagate until there are no more susceptible members in the population.

八卦协议的主要目标是使用协作传播将信息从一个进程传播到集群的其余部分。正如病毒通过从一个人传播到另一个人而在人群中传播一样,每一步传播范围都可能扩大,信息通过系统传递,涉及更多流程。

The main objective of gossip protocols is to use cooperative propagation to disseminate information from one process to the rest of the cluster. Just as a virus spreads through the human population by being passed from one individual to another, potentially increasing in scope with each step, information is relayed through the system, getting more processes involved.

A保存必须传播的记录的过程被认为是具有传染性的。任何尚未收到更新的进程很容易受到影响。经过一段时间的活跃传播后,感染过程不愿意传播新状态据说已被删除 [DEMERS87]。所有过程都在易受影响的状态下开始。每当某些数据记录的更新到达时,接收它的进程就会进入感染状态,并开始将更新传播到其他随机相邻进程,从而感染它们。一旦感染进程确定更新已传播,它们就会转移到已删除状态。

A process that holds a record that has to be spread around is said to be infective. Any process that hasn’t received the update yet is then susceptible. Infective processes not willing to propagate the new state after a period of active dissemination are said to be removed [DEMERS87]. All processes start in a susceptible state. Whenever an update for some data record arrives, a process that received it moves to the infective state and starts disseminating the update to other random neighboring processes, infecting them. As soon as the infective processes become certain that the update was propagated, they move to the removed state.

为了避免显式协调和维护全局接收者列表以及需要单个协调器向系统中的每个其他参与者广播消息,此类算法模型使用兴趣损失函数的完整性。然后,协议效率取决于它能够以多快的速度感染尽可能多的节点,同时将冗余消息造成的开销降至最低。

To avoid explicit coordination and maintaining a global list of recipients and requiring a single coordinator to broadcast messages to each other participant in the system, this class of algorithms models completeness using the loss of interest function. The protocol efficiency is then determined by how quickly it can infect as many nodes as possible, while keeping overhead caused by redundant messages to a minimum.

八卦可用于同构去中心化系统中的异步消息传递,其中节点可能不具有长期成员资格或以任何拓扑组织。由于八卦协议通常不需要显式协调,因此它们在具有灵活成员资格(节点频繁加入和离开)的系统或网状网络中非常有用。

Gossip can be used for asynchronous message delivery in homogeneous decentralized systems, where nodes may not have long-term membership or be organized in any topology. Since gossip protocols generally do not require explicit coordination, they can be useful in systems with flexible membership (where nodes are joining and leaving frequently) or mesh networks.

Gossip 协议非常强大,有助于在分布式系统固有的故障存在时实现高可靠性。由于消息以随机方式中继,因此即使它们之间的某些通信组件发生故障,它们仍然可以传递,只是通过不同的路径。可以说,系统能够适应失败。

Gossip protocols are very robust and help to achieve high reliability in the presence of failures inherent to distributed systems. Since messages are relayed in a randomized manner, they still can be delivered even if some communication components between them fail, just through the different paths. It can be said that the system adapts to failures.

八卦机制

Gossip Mechanics

定期处理f随机选择对等点(其中f是一个可配置参数,称为fanout)并与它们交换当前“热门”信息。每当该进程从其同级进程获悉一条新信息时,它就会尝试进一步传递该信息。由于对等点是按概率选择的,因此总会存在一些重叠,并且消息将被重复传递并可能会继续循环一段时间。消息冗余度捕获重复交付所产生的开销的指标。冗余是一个重要的属性,对于八卦的运作方式至关重要。

Processes periodically select f peers at random (where f is a configurable parameter, called fanout) and exchange currently “hot” information with them. Whenever the process learns about a new piece of information from its peers, it will attempt to pass it on further. Because peers are selected probabilistically, there will always be some overlap, and messages will get delivered repeatedly and may continue circulating for some time. Message redundancy is a metric that captures the overhead incurred by repeated delivery. Redundancy is an important property, and it is crucial to how gossip works.

系统达到收敛所需的时间称为延迟。达到收敛(停止八卦过程)和将消息传递给所有对等点之间存在细微差别,因为可能会在短时间内通知所有对等点,但八卦仍在继续。扇出和延迟取决于系统大小:在较大的系统中,我们要么必须增加扇出以保持延迟稳定,要么允许更高的延迟。

The amount of time the system requires to reach convergence is called latency. There’s a slight difference between reaching convergence (stopping the gossip process) and delivering the message to all peers, since there might be a short period during which all peers are notified, but gossip continues. Fanout and latency depend on the system size: in a larger system, we either have to increase the fanout to keep latency stable, or allow higher latency.

随着时间的推移,当节点注意到它们一次又一次地接收到相同的信息时,该消息将开始失去重要性,节点最终将不得不停止转发它。兴趣损失可以通过概率计算(为每个步骤的每个过程计算传播停止的概率)或使用(计算接收到的重复项的数量,当该数量过高时停止传播)。两种方法都必须考虑集群大小和扇出。计算重复项以衡量收敛性可以改善延迟并减少冗余[DEMERS87]

Over time, as the nodes notice they’ve been receiving the same information again and again, the message will start losing importance and nodes will have to eventually stop relaying it. Interest loss can be computed either probabilistically (the probability of propagation stop is computed for each process on every step) or using a threshold (the number of received duplicates is counted, and propagation is stopped when this number is too high). Both approaches have to take the cluster size and fanout into consideration. Counting duplicates to measure convergence can improve latency and reduce redundancy [DEMERS87].

在一致性方面,八卦协议提供收敛一致性[BIRMAN07]:节点有更高的概率对过去发生的事件有相同的看法。

In terms of consistency, gossip protocols offer convergent consistency [BIRMAN07]: nodes have a higher probability to have the same view of the events that occurred further in the past.

覆盖网络

Overlay Networks

甚至尽管八卦协议很重要且有用,但它们通常应用于一小部分问题。非流行病方法可以以非概率确定性、较少冗余且通常以更优化的方式分发消息[BIRMAN07]。Gossip 算法经常因其可扩展性而受到赞扬,并且事实上可以在log N消息轮次内分发消息(其中N是集群的大小)[KREMARREC07],但保留数量很重要还要记住八卦回合中产生的冗余消息。为了实现可靠性,基于八卦的协议会产生一些重复的消息传递。

Even though gossip protocols are important and useful, they’re usually applied for a narrow set of problems. Nonepidemic approaches can distribute the message with nonprobabilistic certainty, less redundancy, and generally in a more optimal way [BIRMAN07]. Gossip algorithms are often praised for their scalability and the fact it is possible to distribute a message within log N message rounds (where N is the size of the cluster) [KREMARREC07], but it’s important to keep the number of redundant messages generated during gossip rounds in mind as well. To achieve reliability, gossip-based protocols produce some duplicate message deliveries.

随机选择节点大大提高了系统健壮性:如果存在网络分区,如果存在间接连接两个进程的链路,消息最终将被传递。这种方法的明显缺点是它不是消息最优的:为了保证鲁棒性,我们必须维护对等点之间的冗余连接并发送冗余消息。

Selecting nodes at random greatly improves system robustness: if there is a network partition, messages will be delivered eventually if there are links that indirectly connect two processes. The obvious downside of this approach is that it is not message-optimal: to guarantee robustness, we have to maintain redundant connections between the peers and send redundant messages.

两种方法之间的中间立场是在八卦系统中构建临时固定拓扑。这可以通过创建对等点的覆盖网络来实现:节点可以对其对等点进行采样并根据邻近度(通常通过延迟来衡量)选择最佳接触点。

A middle ground between the two approaches is to construct a temporary fixed topology in a gossip system. This can be achieved by creating an overlay network of peers: nodes can sample their peers and select the best contact points based on proximity (usually measured by the latency).

节点系统中可以形成生成树:具有明显边的单向、无环图,覆盖整个网络。有了这样的图,消息就可以按照固定数量的步骤进行分发。

Nodes in the system can form spanning trees: unidirected, loop-free graphs with distinct edges, covering the whole network. Having such a graph, messages can be distributed in a fixed number of steps.

图 12-4显示了生成树的示例:1

Figure 12-4 shows an example of a spanning tree:1

  • a)我们在不使用所有边的情况下实现了点之间的完全连接。

  • a) We achieve full connectivity between the points without using all the edges.

  • b) 如果只有一个链接断开,我们可能会失去与整个子树的连接。

  • b) We can lose connectivity to the entire subtree if just a single link is broken.

数据库1204
图 12-4。生成树。黑点代表节点。黑线代表覆盖网络。灰线表示节点之间其他可能存在的连接。

这种方法的潜在缺点之一是,它可能会导致形成相互关联的“孤岛”,这些“孤岛”由彼此具有强烈偏好的同行组成。

One of the potential downsides of this approach is that it might lead to forming interconnected “islands” of peers having strong preferences toward each other.

为了保持较低的消息数量,同时允许在连接丢失时快速恢复,我们可以在系统处于稳定状态时混合使用两种方法(固定拓扑和基于树的广播),并回退到八卦以进行故障转移和系统恢复。

To keep the number of messages low, while allowing quick recovery in case of a connectivity loss, we can mix both approaches—fixed topologies and tree-based broadcast—when the system is in a stable state, and fall back to gossip for failover and system recovery.

混合八卦

Hybrid Gossip

推送/惰性推送多播树(Plumtrees) [LEITAO07 ]流行病和基于树的广播原语之间的权衡。Plumtree 的工作原理是创建节点的生成树覆盖,以最小的开销主动分发消息。在正常情况下,节点仅向对等采样服务提供的一小部分对等点发送完整消息。

Push/lazy-push multicast trees (Plumtrees) [LEITAO07] make a trade-off between epidemic and tree-based broadcast primitives. Plumtrees work by creating a spanning tree overlay of nodes to actively distribute messages with the smallest overhead. Under normal conditions, nodes send full messages to just a small subset of peers provided by the peer sampling service.

每个节点将完整消息发送到一小部分节点,而对于其余节点,它仅延迟转发消息 ID。如果节点收到它从未见过的消息的标识符,它可以查询其对等体以获取它。这 惰性推送步骤确保了高可靠性,并提供了一种快速修复广播树的方法。如果出现故障,协议会通过惰性推送步骤退回到八卦方法,广播消息并修复覆盖。

Each node sends the full message to the small subset of nodes, and for the rest of the nodes, it lazily forwards only the message ID. If the node receives the identifier of a message it has never seen, it can query its peers to get it. This lazy-push step ensures high reliability and provides a way to quickly heal the broadcast tree. In case of failures, protocol falls back to the gossip approach through lazy-push steps, broadcasting the message and repairing the overlay.

由于分布式系统的性质,任何节点或节点之间的链接可能随时发生故障,从而导致当段不可达时无法遍历树。懒惰的八卦网络有助于向同级通知所看到的消息,以便构建和修复树。

Due to the nature of distributed systems, any node or link between the nodes might fail at any time, making it impossible to traverse the tree when the segment becomes unreachable. The lazy gossip network helps to notify peers about seen messages in order to construct and repair the tree.

图 12-5显示了这种双重连接的图示:节点通过最佳生成树(实线)和惰性八卦网络(虚线)连接。该图并不代表任何特定的网络拓扑,而仅代表节点之间的连接。

Figure 12-5 shows an illustration of such double connectivity: nodes are connected with an optimal spanning tree (solid lines) and the lazy gossip network (dotted lines). This illustration does not represent any particular network topology, but only connections between the nodes.

数据库1205
图 12-5。懒惰和热切的推送网络。实线代表广播树。虚线代表懒惰的八卦连接。

使用延迟推送机制进行树构建和修复的优点之一是,在负载恒定的网络中,它往往会生成一棵也能最小化消息延迟的树,因为首先响应的节点会添加到广播中树。

One of the advantages of using the lazy-push mechanism for tree construction and repair is that in a network with constant load, it will tend to generate a tree that also minimizes message latency, since nodes that are first to respond are added to the broadcast tree.

部分视图

Partial Views

广播向所有已知对等点发送消息并维护集群的完整视图可能会变得昂贵且不切实际,尤其是在流失(衡量系统中加入和离开节点数量的指标)很高。为了避免这种情况,八卦协议经常使用同行抽样服务。该服务维护集群的部分视图,并使用八卦定期刷新。部分视图重叠,因为八卦协议需要一定程度的冗余,但过多的冗余意味着我们正在做额外的工作。

Broadcasting messages to all known peers and maintaining a full view of the cluster can get expensive and impractical, especially if the churn (measure of the number of joining and leaving nodes in the system) is high. To avoid this, gossip protocols often use a peer sampling service. This service maintains a partial view of the cluster, which is periodically refreshed using gossip. Partial views overlap, as some degree of redundancy is desired in gossip protocols, but too much redundancy means we’re doing extra work.

例如,混合部分视图 (HyParView) 协议[LEITAO07]维护集群的一个较小的主动视图和一个较大的被动视图。活动视图中的节点创建可用于传播的覆盖层。被动视图用于维护节点列表,这些节点可用于替换主动视图中出现故障的节点。

For example, the Hybrid Partial View (HyParView) protocol [LEITAO07] maintains a small active view and a larger passive view of the cluster. Nodes from the active view create an overlay that can be used for dissemination. Passive view is used to maintain a list of nodes that can be used to replace the failed ones from the active view.

节点定期执行洗牌操作,在此期间它们交换主动和被动视图。在此交换期间,节点将从对等方收到的被动视图和主动视图中的成员添加到其被动视图中,循环出最旧的值以限制列表大小。

Periodically, nodes perform a shuffle operation, during which they exchange their active and passive views. During this exchange, nodes add the members from both passive and active views they receive from their peers to their passive views, cycling out the oldest values to cap the list size.

活动视图根据该视图中节点的状态变化和来自对等点的请求进行更新。如果进程怀疑其主动视图中的对等方之一发生故障,则从其主动视图中删除并尝试与被动视图中的替换进程建立连接。如果连接失败,将从 的被动视图中删除。P1P2P1P2P3P3P1

The active view is updated depending on the state changes of nodes in this view and requests from peers. If a process P1 suspects that P2, one of the peers from its active view, has failed, P1 removes P2 from its active view and attempts to establish a connection with a replacement process P3 from the passive view. If the connection fails, P3 is removed from the passive view of P1.

根据的活动视图中的进程数量,如果其活动视图已满,则可以选择拒绝连接。如果的视图为空,则必须将其当前活动视图对等点之一替换为。这有助于引导或恢复节点快速成为集群的有效成员,但需要循环一些连接。P1P3P1P3 P1

Depending on the number of processes in P1’s active view, P3 may choose to decline the connection if its active view is already full. If P1’s view is empty, P3 has to replace one of its current active view peers with P1. This helps bootstrapping or recovering nodes to quickly become effective members of the cluster at the cost of cycling some connections.

这种方法仅使用主动视图节点进行传播,有助于减少系统中的消息数量,同时通过使用被动视图作为恢复机制来保持高可靠性。性能和质量衡量标准之一是在拓扑重组的情况下对等采样服务收敛到稳定覆盖的速度如何[JELASITY04]。HyParView 在这里得分相当高,因为视图的维护方式以及它优先考虑引导进程。

This approach helps to reduce the number of messages in the system by using only active view nodes for dissemination, while maintaining high reliability by using passive views as a recovery mechanism. One of the performance and quality measures is how quickly a peer sampling service converges to a stable overlay in cases of topology reorganization [JELASITY04]. HyParView scores rather high here, because of how the views are maintained and since it gives priority to bootstrapping processes.

海帕视图Plumtree 使用混合八卦方法:使用一小部分对等点来广播消息,并在出现故障和网络分区时回退到更广泛的对等点网络。两个系统都不依赖于包含所有对等方的全局视图,这不仅因为系统中存在大量节点(大多数情况下情况并非如此),而且还因为与维护每个节点上的最新成员列表。部分视图允许节点仅与一小部分相邻节点主动通信。

HyParView and Plumtree use a hybrid gossip approach: using a small subset of peers for broadcasting messages and falling back to a wider network of peers in case of failures and network partitions. Both systems do not rely on a global view that includes all the peers, which can be helpful not only because of a large number of nodes in the system (which is not the case most of the time), but also because of costs associated with maintaining an up-to-date list of members on every node. Partial views allow nodes to actively communicate with only a small subset of neighboring nodes.

概括

Summary

最终一致系统允许副本状态分歧。可调节的一致性使我们能够用一致性来换取可用性,反之亦然。副本分歧可以使用反熵机制之一来解决:

Eventually consistent systems allow replica state divergence. Tunable consistency allows us to trade consistency for availability and vice versa. Replica divergence can be resolved using one of the anti-entropy mechanisms:

提示切换
Hinted handoff

在目标关闭时临时存储相邻节点上的写入,并在目标恢复后立即在目标上重播它们。

Temporarily store writes on neighboring nodes in case the target is down, and replay them on the target as soon as it comes back up.

读修复
Read-repair

通过比较响应、检测丢失的记录并将其发送到滞后副本,在读取期间协调请求的数据范围。

Reconcile requested data ranges during the read by comparing responses, detecting missing records, and sending them to lagging replicas.

默克尔树
Merkle trees

通过计算和交换哈希分层树来检测需要修复的数据范围。

Detect data ranges that require repair by computing and exchanging hierarchical trees of hashes.

位图版本向量
Bitmap version vectors

通过维护包含最新写入信息的紧凑记录来检测丢失的副本写入。

Detect missing replica writes by maintaining compact records containing information about the most recent writes.

这些反熵方法针对三个参数之一进行优化:范围缩小、新近度或完整性。我们可以通过仅同步正在主动查询的数据(读取修复)或单个丢失的写入(提示切换)来缩小反熵的范围。如果我们假设大多数故障都是暂时的,并且参与者会尽快从中恢复,那么我们可以存储最近分歧事件的日志,并准确地知道发生故障时要同步的内容(位图版本向量)。如果我们需要成对比较多个节点上的整个数据集并有效地定位它们之间的差异,我们可以对数据进行散列并比较散列(Merkle 树)。

These anti-entropy approaches optimize for one of the three parameters: scope reduction, recency, or completeness. We can reduce the scope of anti-entropy by only synchronizing the data that is being actively queried (read-repairs) or individual missing writes (hinted handoff). If we assume that most failures are temporary and participants recover from them as quickly as possible, we can store the log of the most recent diverged events and know exactly what to synchronize in the event of failure (bitmap version vectors). If we need to compare entire datasets on multiple nodes pairwise and efficiently locate differences between them, we can hash the data and compare hashes (Merkle trees).

为了在大型系统中可靠地分发信息,可以使用八卦协议。混合八卦协议减少了交换消息的数量,同时在可能的情况下保持对网络分区的抵抗力。

To reliably distribute information in a large-scale system, gossip protocols can be used. Hybrid gossip protocols reduce the number of exchanged messages while remaining resistant to network partitions, when possible.

许多现代系统使用八卦来进行故障检测和成员信息[DECANDIA07]。HyParView用于高性能、高可扩展的分布式计算框架Partisan Plumtree 在Riak 核心中用于获取集群范围的信息。

Many modern systems use gossip for failure detection and membership information [DECANDIA07]. HyParView is used in Partisan, the high-performance, high-scalability distributed computing framework. Plumtree was used in the Riak core for cluster-wide information.

1该示例仅用于说明:网络中的节点一般不排列成网格。

1 This example is only used for illustration: nodes in the network are generally not arranged in a grid.

第 13 章分布式事务

Chapter 13. Distributed Transactions

维护分布式系统中的秩序,我们必须至少保证一定的一致性。在“一致性模型”中,我们讨论了单对象、单操作一致性模型,它可以帮助我们推理各个操作。然而,在数据库中我们经常需要原子地执行多个操作。

To maintain order in a distributed system, we have to guarantee at least some consistency. In “Consistency Models”, we talked about single-object, single-operation consistency models that help us to reason about the individual operations. However, in databases we often need to execute multiple operations atomically.

原子操作用状态转换来解释:数据库处于A特定事务开始之前的状态;当它完成时,状态从A变为B。从操作角度来说,这很容易理解,因为事务没有预先确定的附加状态。相反,他们从某个时间点开始对数据记录应用操作。这为我们在调度和执行方面提供了一定的灵活性:事务可以重新排序,甚至重试。

Atomic operations are explained in terms of state transitions: the database was in state A before a particular transaction was started; by the time it finished, the state went from A to B. In operation terms, this is simple to understand, since transactions have no predetermined attached state. Instead, they apply operations to data records starting at some point in time. This gives us some flexibility in terms of scheduling and execution: transactions can be reordered and even retried.

事务处理的主要重点是确定允许的历史记录,用于建模和表示可能的交错执行场景。在这种情况下,历史记录代表一个依赖图:在当前事务执行之前已经执行了哪些事务。历史如果它与顺序执行这些事务的某些历史等效(即具有相同的依赖图),则称为可序列化。您可以在“可串行化”中回顾历史的概念、它们的等价性、可串行化以及其他概念。一般来说,本章是第 5 章的分布式系统对应部分,我们在其中讨论了节点本地事务处理。

The main focus of transaction processing is to determine permissible histories, to model and represent possible interleaving execution scenarios. History, in this case, represents a dependency graph: which transactions have been executed prior to execution of the current transaction. History is said to be serializable if it is equivalent (i.e., has the same dependency graph) to some history that executes these transactions sequentially. You can review concepts of histories, their equivalence, serializability, and other concepts in “Serializability”. Generally, this chapter is a distributed systems counterpart of Chapter 5, where we discussed node-local transaction processing.

单分区事务涉及我们在第 5 章中讨论的悲观(基于锁或跟踪)或乐观(尝试和验证)并发控制方案,但是这些方法都没有解决多分区事务的问题,多分区事务需要不同服务器之间的协调、分布式提交、和回滚协议。

Single-partition transactions involve the pessimistic (lock-based or tracking) or optimistic (try and validate) concurrency control schemes that we discussed in Chapter 5, but neither one of these approaches solves the problem of multipartition transactions, which require coordination between different servers, distributed commit, and rollback protocols.

一般来说,当将钱从一个帐户转移到另一个帐户时,您希望同时贷记第一个帐户并借记第二个帐户。然而,如果我们将交易分解为单独的步骤,乍一看,即使借方或贷方看起来也不是原子的:我们需要读取旧余额,添加或减去所需的金额,然后保存此结果。这些子步骤中的每一个都涉及多项操作:节点接收请求,解析请求,在磁盘上定位数据,进行写入,最后确认它。即使这是一个相当高级的视图:要执行简单的写入,我们必须执行数百个小步骤。

Generally speaking, when transferring money from one account to another, you’d like to both credit the first account and debit the second one simultaneously. However, if we break down the transaction into individual steps, even debiting or crediting doesn’t look atomic at first sight: we need to read the old balance, add or subtract the required amount, and save this result. Each one of these substeps involves several operations: the node receives a request, parses it, locates the data on disk, makes a write and, finally, acknowledges it. Even this is a rather high-level view: to execute a simple write, we have to perform hundreds of small steps.

这意味着我们必须首先执行交易,然后才使其结果可见。但让我们首先定义什么是交易。事务是一组操作,是执行的原子单元事务原子性意味着其所有结果都变得可见,或者都不可见。例如,如果我们在单个事务中修改多行,甚至是表,则所有修改都将被应用,或者不被应用。

This means that we have to first execute the transaction and only then make its results visible. But let’s first define what transactions are. A transaction is a set of operations, an atomic unit of execution. Transaction atomicity implies that all its results become visible or none of them do. For example, if we modify several rows, or even tables in a single transaction, either all or none of the modifications will be applied.

保证原子性,事务应该是可恢复的。换句话说,如果事务无法完成、中止或超时,则必须完全回滚其结果。不可恢复、部分执行的事务可能会使数据库处于不一致状态。总之,如果事务执行不成功,数据库状态必须恢复到之前的状态,就好像该事务从未尝试过一样。

To ensure atomicity, transactions should be recoverable. In other words, if the transaction cannot complete, is aborted, or times out, its results have to be rolled back completely. A nonrecoverable, partially executed transaction can leave the database in an inconsistent state. In summary, in case of unsuccessful transaction execution, the database state has to be reverted to its previous state, as if this transaction was never tried in the first place.

另一个重要方面是网络分区和节点故障:系统中的节点独立故障和恢复,但它们的状态必须保持一致。这意味着原子性要求不仅适用于本地操作,还适用于在其他节点上执行的操作:更改必须持久地传播到事务中涉及的所有节点,或者不传播到任何一个节点 [LAMPSON79 ]

Another important aspect is network partitions and node failures: nodes in the system fail and recover independently, but their states have to remain consistent. This means that the atomicity requirement holds not only for the local operations, but also for operations executed on other nodes: changes have to be durably propagated to all of the nodes involved in the transaction or none of them [LAMPSON79].

让操作看起来原子化

Making Operations Appear Atomic

使多个操作显得原子,特别是如果其中一些操作是远程的,我们需要使用一类称为原子承诺的算法。原子承诺不允许参与者之间存在分歧:即使其中一个参与者投票反对,交易也不会提交。同时,这意味着失败的流程必须得出与其他流程相同的结论。这一事实的另一个重要含义是,原子提交算法在存在拜占庭失败的情况下不起作用:当进程谎报其状态或决定任意值时,因为它与一致意见相矛盾 [HADZILACOS05 ]

To make multiple operations appear atomic, especially if some of them are remote, we need to use a class of algorithms called atomic commitment. Atomic commitment doesn’t allow disagreements between the participants: a transaction will not commit if even one of the participants votes against it. At the same time, this means that failed processes have to reach the same conclusion as the rest of the cohort. Another important implication of this fact is that atomic commitment algorithms do not work in the presence of Byzantine failures: when the process lies about its state or decides on an arbitrary value, since it contradicts unanimity [HADZILACOS05].

原子承诺试图解决的问题是就是否执行提议的交易达成一致。群组不能选择、影响或更改拟议的交易或提出任何替代方案:他们只能投票决定是否愿意执行交易[ROBINSON08]

The problem that atomic commitment is trying to solve is reaching an agreement on whether or not to execute the proposed transaction. Cohorts cannot choose, influence, or change the proposed transaction or propose any alternative: they can only give their vote on whether or not they are willing to execute it [ROBINSON08].

原子承诺算法对语义没有提出严格的要求事务准备提交回滚操作。数据库实施者必须决定:

Atomic commitment algorithms do not set strict requirements for the semantics of transaction prepare, commit, or rollback operations. Database implementers have to decide on:

  • 当数据被认为准备好提交时,它们距离公开更改仅差一个指针交换。

  • When the data is considered ready to commit, and they’re just a pointer swap away from making the changes public.

  • 如何执行提交本身以使事务结果在尽可能短的时间内可见。

  • How to perform the commit itself to make transaction results visible in the shortest timeframe possible.

  • 如果算法决定不提交,如何回滚事务所做的更改。

  • How to roll back the changes made by the transaction if the algorithm decides not to commit.

我们在第 5 章中讨论了这些进程的节点本地实现。

We discussed node-local implementations of these processes in Chapter 5.

许多分布式系统使用原子承诺算法,例如 MySQL(用于分布式事务)和 Kafka(用于生产者和消费者交互[MEHTA17])。

Many distributed systems use atomic commitment algorithms—for example, MySQL (for distributed transactions) and Kafka (for producer and consumer interaction [MEHTA17]).

在数据库中,分布式事务由通常称为事务管理器的组件执行。事务管理器是负责调度、协调、执行和跟踪事务的子系统。在分布式环境中,事务管理器负责确保节点本地可见性保证与分布式原子操作规定的可见性一致。换句话说,事务在所有分区和所有副本中提交。

In databases, distributed transactions are executed by the component commonly known as a transaction manager. The transaction manager is a subsystem responsible for scheduling, coordinating, executing, and tracking transactions. In a distributed environment, the transaction manager is responsible for ensuring that node-local visibility guarantees are consistent with the visibility prescribed by distributed atomic operations. In other words, transactions commit in all partitions, and for all replicas.

我们将讨论两种原子提交算法:两阶段提交,它解决了提交问题,但不允许协调器进程失败;以及三阶段提交[SKEEN83],它解决了非阻塞原子提交问题1,并且即使在协调器失败的情况下也允许参与者继续进行[BABAOGLU93]

We will discuss two atomic commitment algorithms: two-phase commit, which solves a commitment problem, but doesn’t allow for failures of the coordinator process; and three-phase commit [SKEEN83], which solves a nonblocking atomic commitment problem,1 and allows participants proceed even in case of coordinator failures [BABAOGLU93].

两阶段提交

Two-Phase Commit

让我们从允许多分区原子更新的最简单的分布式提交协议开始。(有关分区的更多信息,您可以参考“数据库分区”。)两阶段提交(2PC)通常在数据库事务的上下文中讨论。2PC 分两个阶段执行。在第一阶段,分配决定的价值并收集选票。在第二阶段,节点只需翻转开关,使第一阶段的结果可见。

Let’s start with the most straightforward protocol for a distributed commit that allows multipartition atomic updates. (For more information on partitioning, you can refer to “Database Partitioning”.) Two-phase commit (2PC) is usually discussed in the context of database transactions. 2PC executes in two phases. During the first phase, the decided value is distributed, and votes are collected. During the second phase, nodes just flip the switch, making the results of the first phase visible.

2PC假设存在一个领导者(或协调员),该领导者持有状态,收集选票,并且是协议回合的主要参考点。其余的节点称为队列。在这种情况下,群组通常是对不相交数据集进行操作的分区,并针对该数据集执行事务。协调员和每个队列都保存每个执行步骤的本地操作日志。参与者投票接受或拒绝协调员提出的某些值。大多数情况下,该值是必须执行的分布式事务的标识符,但 2PC 也可以在其他上下文中使用。

2PC assumes the presence of a leader (or coordinator) that holds the state, collects votes, and is a primary point of reference for the agreement round. The rest of the nodes are called cohorts. Cohorts, in this case, are usually partitions that operate over disjoint datasets, against which transactions are performed. The coordinator and every cohort keep local operation logs for each executed step. Participants vote to accept or reject some value, proposed by the coordinator. Most often, this value is an identifier of the distributed transaction that has to be executed, but 2PC can be used in other contexts as well.

协调器可以是接收执行事务请求的节点,也可以使用领导者选举算法随机选取、手动分配,甚至可以在系统的整个生命周期中固定。该协议对协调者角色没有限制,并且可以将角色转移给另一个参与者以获得可靠性或性能。

The coordinator can be a node that received a request to execute the transaction, or it can be picked at random, using a leader-election algorithm, assigned manually, or even fixed throughout the lifetime of the system. The protocol does not place restrictions on the coordinator role, and the role can be transferred to another participant for reliability or performance.

顾名思义,两阶段提交分两步执行:

As the name suggests, a two-phase commit is executed in two steps:

准备
Prepare

协调器通过发送消息来通知群组有关新事务的信息Propose。群组决定是否可以提交适用于他们的事务部分。如果一个群组决定可以提交,它会通知协调员有关积极投票的信息。否则,它会响应协调器,要求其中止事务。队列做出的所有决策都保存在协调器日志中,并且每个队列在本地保留其决策的副本。

The coordinator notifies cohorts about the new transaction by sending a Propose message. Cohorts make a decision on whether or not they can commit the part of the transaction that applies to them. If a cohort decides that it can commit, it notifies the coordinator about the positive vote. Otherwise, it responds to the coordinator, asking it to abort the transaction. All decisions taken by cohorts are persisted in the coordinator log, and each cohort keeps a copy of its decision locally.

提交/中止
Commit/abort

事务中的操作可以更改不同分区(每个分区由一个队列表示)之间的状态。如果其中一个群组投票中止交易,协调员就会Abort向所有群组发送消息。只有当所有群组都投了赞成票时,协调员才会向他们发送最终Commit消息。

Operations within a transaction can change state across different partitions (each represented by a cohort). If even one of the cohorts votes to abort the transaction, the coordinator sends the Abort message to all of them. Only if all cohorts have voted positively does the coordinator send them a final Commit message.

该流程如图13-1所示。

This process is shown in Figure 13-1.

准备阶段,协调者分发提议值,并收集参与者对是否应提交该提议值的投票。例如,如果另一个冲突事务已经提交了不同的值,群组可以选择拒绝协调者的提议。

During the prepare phase, the coordinator distributes the proposed value and collects votes from the participants on whether or not this proposed value should be committed. Cohorts may choose to reject the coordinator’s proposal if, for example, another conflicting transaction has already committed a different value.

数据库1301
图 13-1。两阶段提交协议。在第一阶段,群组会收到有关新交易的通知。在第二阶段,事务被提交或中止。

协调员收集选票后,可以决定是否提交交易中止交易。如果所有群组都投了赞成票,它就会决定提交并通过发送消息来通知他们Commit。否则,协调器会Abort向所有队列发送一条消息,并且事务会回滚。换句话说,如果一个节点拒绝该提案,则整个回合将中止。

After the coordinator has collected the votes, it can make a decision on whether to commit the transaction or abort it. If all cohorts have voted positively, it decides to commit and notifies them by sending a Commit message. Otherwise, the coordinator sends an Abort message to all cohorts and the transaction gets rolled back. In other words, if one node rejects the proposal, the whole round is aborted.

在每个步骤中,协调器和队列必须将每个操作的结果写入持久存储,以便能够重建状态并在发生本地故障时进行恢复,并能够为其他参与者转发和重放结果。

During each step the coordinator and cohorts have to write the results of each operation to durable storage to be able to reconstruct the state and recover in case of local failures, and be able to forward and replay results for other participants.

在数据库系统的上下文中,每个 2PC 轮通常负责一个事务。在准备阶段,事务内容(操作、标识符和其他元数据)从协调器传输到队列。事务由队列在本地执行,并处于部分提交状态(有时称为预提交),以便协调器通过提交或中止事务来完成下一阶段的执行。当事务提交时,其内容已经持久存储在所有其他节点上[BERNSTEIN09]

In the context of database systems, each 2PC round is usually responsible for a single transaction. During the prepare phase, transaction contents (operations, identifiers, and other metadata) are transferred from the coordinator to the cohorts. The transaction is executed by the cohorts locally and is left in a partially committed state (sometimes called precommitted), making it ready for the coordinator to finalize execution during the next phase by either committing or aborting it. By the time the transaction commits, its contents are already stored durably on all other nodes [BERNSTEIN09].

2PC 中的队列失败

Cohort Failures in 2PC

让我们考虑几种故障场景。例如,如图13-2所示,如果其中一个队列在提议阶段失败,协调者就无法继续提交,因为它要求所有投票都是积极的。如果其中一个队列不可用,协调器将中止事务。此要求对可用性有负面影响:单个节点的故障可能会阻止事务的发生。某些系统,例如 Spanner(请参阅“使用 Spanner 进行分布式事务”),通过 Paxos 组而不是单个节点执行 2PC,以提高协议可用性。

Let’s consider several failure scenarios. For example, as Figure 13-2 shows, if one of the cohorts fails during the propose phase, the coordinator cannot proceed with a commit, since it requires all votes to be positive. If one of the cohorts is unavailable, the coordinator will abort the transaction. This requirement has a negative impact on availability: failure of a single node can prevent transactions from happening. Some systems, for example, Spanner (see “Distributed Transactions with Spanner”), perform 2PC over Paxos groups rather than individual nodes to improve protocol availability.

数据库1302
图 13-2。提议阶段的队列失败

2PC 背后的主要思想是一群人的承诺,一旦它积极响应提案,它就不会反悔其决定,因此只有协调者可以中止交易。

The main idea behind 2PC is a promise by a cohort that, once it has positively responded to the proposal, it will not go back on its decision, so only the coordinator can abort the transaction.

如果其中一个群组在接受提案后失败,则它必须先了解投票的实际结果,然后才能正确提供值,因为协调员可能会由于其他群组的决定而中止提交。当队列节点恢复时,它必须加快协调器的最终决策。通常,这是通过在协调器端保留决策日志并将决策值复制到失败的参与者来完成的。在此之前,队列无法处理请求,因为它处于不一致的状态。

If one of the cohorts has failed after accepting the proposal, it has to learn about the actual outcome of the vote before it can serve values correctly, since the coordinator might have aborted the commit due to the other cohorts’ decisions. When a cohort node recovers, it has to get up to speed with a final coordinator decision. Usually, this is done by persisting the decision log on the coordinator side and replicating decision values to the failed participants. Until then, the cohort cannot serve requests because it is in an inconsistent state.

由于协议有多个进程正在等待其他参与者的地方(当协调者收集选票时,或者当队列等待提交/中止阶段时),链路故障可能会导致消息丢失,并且这种等待将无限期地持续下去。如果协调器在提议阶段没有收到副本的响应,它可以触发超时并中止事务。

Since the protocol has multiple spots where processes are waiting for the other participants (when the coordinator collects votes, or when the cohort is waiting for the commit/abort phase), link failures might lead to message loss, and this wait will continue indefinitely. If the coordinator does not receive a response from the replica during the propose phase, it can trigger a timeout and abort the transaction.

2PC 中的协调器故障

Coordinator Failures in 2PC

如果在第二阶段,其中一个队列没有收到协调器的提交或中止命令,如图13-3所示,它应该尝试找出协调器做出了哪个决定。协调器可能已经决定了该值,但无法将其传达给特定的副本。在这种情况下,可以从对等方的事务日志或备份协调器复制有关决策的信息。复制提交决策是安全的,因为它始终是一致的:2PC 的重点是在所有站点上提交或中止,并且在一个队列上提交意味着所有其他队列都必须提交。

If one of the cohorts does not receive a commit or abort command from the coordinator during the second phase, as shown in Figure 13-3, it should attempt to find out which decision was made by the coordinator. The coordinator might have decided upon the value but wasn’t able to communicate it to the particular replica. In such cases, information about the decision can be replicated from the peers’ transaction logs or from the backup coordinator. Replicating commit decisions is safe since it’s always unanimous: the whole point of 2PC is to either commit or abort on all sites, and commit on one cohort implies that all other cohorts have to commit.

数据库1303
图 13-3。协调器在提议阶段后失败

在第一阶段,协调员收集选票,随后向群组承诺,他们将等待其明确的提交或中止命令。如果协调员在收集选票后但在广播投票结果之前失败,则群组最终会处于不确定状态。如图 13-4所示。队列不知道协调员到底决定了什么,也不知道是否有任何参与者(也可能无法到达)已收到有关交易结果的通知[BERNSTEIN87]

During the first phase, the coordinator collects votes and, subsequently, promises from cohorts, that they will wait for its explicit commit or abort command. If the coordinator fails after collecting the votes, but before broadcasting vote results, the cohorts end up in a state of uncertainty. This is shown in Figure 13-4. Cohorts do not know what precisely the coordinator has decided, and whether or not any of the participants (potentially also unreachable) might have been notified about the transaction results [BERNSTEIN87].

数据库1304
图 13-4。协调器在联系任何队列之前失败

协调器无法继续提交或中止会使集群处于未决定状态。这意味着,如果协调器永久失败,群组将无法了解最终决定。由于这个属性,我们说 2PC是一种阻塞原子提交算法。如果协调员永远无法恢复,其替代者必须再次为给定的交易收集选票,并做出最终决定。

Inability of the coordinator to proceed with a commit or abort leaves the cluster in an undecided state. This means that cohorts will not be able to learn about the final decision in case of a permanent coordinator failure. Because of this property, we say that 2PC is a blocking atomic commitment algorithm. If the coordinator never recovers, its replacement has to collect votes for a given transaction again, and proceed with a final decision.

许多数据库使用 2PC:MySQL、PostgreSQL、MongoDB、2等。两阶段提交因其简单(易于推理、实现和调试)和低开销(消息复杂度和协议往返次数较低)而常用于实现分布式事务。实施适当的恢复机制并拥有备份协调器节点以减少发生上述故障的可能性非常重要。

Many databases use 2PC: MySQL, PostgreSQL, MongoDB,2 and others. Two-phase commit is often used to implement distributed transactions because of its simplicity (it is easy to reason about, implement, and debug) and low overhead (message complexity and the number of round-trips of the protocol are low). It is important to implement proper recovery mechanisms and have backup coordinator nodes to reduce the chance of the failures just described.

三阶段提交

Three-Phase Commit

为了使原子承诺协议对协调器故障具有鲁棒性并避免未决定的状态,三阶段提交(3PC)协议添加了一个额外的步骤,并且双方都有超时,可以允许队列在协调器故障时继续进行提交或中止,取决于系统状态。3PC 假设同步模型并且不可能出现通信故障[BABAOGLU93]

To make an atomic commitment protocol robust against coordinator failures and avoid undecided states, the three-phase commit (3PC) protocol adds an extra step, and timeouts on both sides that can allow cohorts to proceed with either commit or abort in the event of coordinator failure, depending on the system state. 3PC assumes a synchronous model and that communication failures are not possible [BABAOGLU93].

3PC在提交/中止步骤之前添加了一个准备阶段,该阶段传达协调器在提议阶段收集的队列状态,从而即使协调器失败也允许协议继续执行。3PC 的所有其他属性以及一轮协调员的要求都与其两阶段兄弟类似。3PC 的另一个有用的补充是群组方面的超时。根据进程当前正在执行的步骤,超时时会强制执行提交或中止决策。

3PC adds a prepare phase before the commit/abort step, which communicates cohort states collected by the coordinator during the propose phase, allowing the protocol to carry on even if the coordinator fails. All other properties of 3PC and a requirement to have a coordinator for the round are similar to its two-phase sibling. Another useful addition to 3PC is timeouts on the cohort side. Depending on which step the process is currently executing, either a commit or abort decision is forced on timeout.

如图13-5所示,三阶段提交回合由三个步骤组成:

As Figure 13-5 shows, the three-phase commit round consists of three steps:

提出
Propose

协调员发出提议值并收集选票。

The coordinator sends out a proposed value and collects the votes.

准备
Prepare

协调员向群组通知投票结果。如果投票通过并且所有群组都决定提交,协调员会发送一条Prepare消息,指示他们准备提交。否则,Abort将发送一条消息并完成该回合。

The coordinator notifies cohorts about the vote results. If the vote has passed and all cohorts have decided to commit, the coordinator sends a Prepare message, instructing them to prepare to commit. Otherwise, an Abort message is sent and the round completes.

犯罪
Commit

协调员通知群组提交事务。

Cohorts are notified by the coordinator to commit the transaction.

数据库1305
图 13-5。三阶段提交

提议步骤中,与 2PC 类似,协调者分配提议值并从队列中收集投票,​​如图13-5所示。如果协调器在此阶段崩溃并且操作超时,或者如果其中一个群组投了反对票,则交易将被中止。

During the propose step, similar to 2PC, the coordinator distributes the proposed value and collects votes from cohorts, as shown in Figure 13-5. If the coordinator crashes during this phase and the operation times out, or if one of the cohorts votes negatively, the transaction will be aborted.

收集选票后,协调员做出决定。如果协调器决定继续处理事务,它会发出命令Prepare。协调器可能无法将准备消息分发给所有群组,或者无法收到他们的确认。在这种情况下,队列可能会在超时后中止事务,因为算法尚未完全移动到准备状态

After collecting the votes, the coordinator makes a decision. If the coordinator decides to proceed with a transaction, it issues a Prepare command. It may happen that the coordinator cannot distribute prepare messages to all cohorts or it fails to receive their acknowledgments. In this case, cohorts may abort the transaction after timeout, since the algorithm hasn’t moved all the way to the prepared state.

一旦所有队列成功进入准备状态并且协调器收到他们的准备确认,如果任何一方失败,事务将被提交。这是可以做到的,因为这个阶段的所有参与者对状态都有相同的看法。

As soon as all the cohorts successfully move into the prepared state and the coordinator has received their prepare acknowledgments, the transaction will be committed if either side fails. This can be done since all participants at this stage have the same view of the state.

提交期间,协调器将准备阶段的结果传达给所有参与者,重置超时计数器并有效地完成事务

During commit, the coordinator communicates the results of the prepare phase to all the participants, resetting their timeout counters and effectively finishing the transaction.

3PC 中的协调器故障

Coordinator Failures in 3PC

所有状态转换都是协调的,并且在每个人都完成前一个阶段之前,群组无法进入下一阶段:协调器必须等待副本继续。如果队列在超时之前没有收到协调器的消息,如果他们没有通过准备阶段,他们最终可以中止事务。

All state transitions are coordinated, and cohorts can’t move on to the next phase until everyone is done with the previous one: the coordinator has to wait for the replicas to continue. Cohorts can eventually abort the transaction if they do not hear from the coordinator before the timeout, if they didn’t move past the prepare phase.

正如我们之前讨论的,2PC 无法从协调器故障中恢复,并且队列可能会陷入不确定状态,直到协调器恢复为止。3PC 避免了这种情况下的流程阻塞,并允许群组继续做出确定性决策。

As we discussed previously, 2PC cannot recover from coordinator failures, and cohorts may get stuck in a nondeterministic state until the coordinator comes back. 3PC avoids blocking the processes in this case and allows cohorts to proceed with a deterministic decision.

3PC最坏的情况是网络分区,如图13-6所示。一些节点成功进入准备状态,现在可以在超时后继续提交。有些无法与协调器通信,超时后会中止。这会导致脑裂:一些节点继续提交,一些节点中止,所有这些都按照协议进行,使参与者处于不一致和矛盾的状态。

The worst-case scenario for the 3PC is a network partition, shown in Figure 13-6. Some nodes successfully move to the prepared state, and now can proceed with commit after the timeout. Some can’t communicate with the coordinator, and will abort after the timeout. This results in a split brain: some nodes proceed with a commit and some abort, all according to the protocol, leaving participants in an inconsistent and contradictory state.

数据库1306
图 13-6。第二阶段协调器故障

虽然理论上 3PC 确实在一定程度上解决了 2PC 阻塞的问题,但它具有较大的消息开销,引入了潜在的矛盾,并且在存在网络分区的情况下不能很好地工作。这可能是3PC在实践中没有得到广泛应用的主要原因。

While in theory 3PC does, to a degree, solve the problem with 2PC blocking, it has a larger message overhead, introduces potential contradictions, and does not work well in the presence of network partitions. This might be the primary reason 3PC is not widely used in practice.

与 Calvin 的分布式事务

Distributed Transactions with Calvin

我们已经已经谈到了同步成本的主题以及解决它的几种方法。但是还有其他方法可以减少争用以及事务持有锁的总时间。实现此目的的方法之一是在获取锁并继续执行之前让副本就执行顺序和事务边界达成一致。如果我们能够实现这一点,节点故障就不会导致事务中止,因为节点可以从并行执行同一事务的其他参与者恢复状态。

We’ve already touched on the subject of synchronization costs and several ways around it. But there are other ways to reduce contention and the total amount of time during which transactions hold locks. One of the ways to do this is to let replicas agree on the execution order and transaction boundaries before acquiring locks and proceeding with execution. If we can achieve this, node failures do not cause transaction aborts, since nodes can recover state from other participants that execute the same transaction in parallel.

传统的数据库系统使用两阶段锁定或乐观并发控制来执行事务,并且没有确定的事务顺序。这意味着必须协调节点以保持秩序。确定性事务顺序消除了执行阶段的协调开销,并且由于所有副本都获得相同的输入,因此它们也会产生相同的输出。这种方法通常称为 Calvin,一种快速分布式事务协议[THOMSON12]FaunaDB是使用 Calvin 实现分布式事务的突出示例之一。

Traditional database systems execute transactions using two-phase locking or optimistic concurrency control and have no deterministic transaction order. This means that nodes have to be coordinated to preserve order. Deterministic transaction order removes coordination overhead during the execution phase and, since all replicas get the same inputs, they also produce equivalent outputs. This approach is commonly known as Calvin, a fast distributed transaction protocol [THOMSON12]. One of the prominent examples implementing distributed transactions using Calvin is FaunaDB.

为了实现确定性秩序,加尔文使用定序器:所有事务的入口点。定序器确定事务执行的顺序,并建立全局事务输入序列。为了最大限度地减少争用和批量决策,时间表被分为epoch。排序器收集交易并将它们分组为短时间窗口(原始论文提到 10 毫秒的批次),这些窗口也成为复制单元,因此交易不必单独进行通信。

To achieve deterministic order, Calvin uses a sequencer: an entry point for all transactions. The sequencer determines the order in which transactions are executed, and establishes a global transaction input sequence. To minimize contention and batch decisions, the timeline is split into epochs. The sequencer collects transactions and groups them into short time windows (the original paper mentions 10-millisecond batches), which also become replication units, so transactions do not have to be communicated separately.

一旦成功复制事务批次,定序器就会将其转发到调度程序,协调事务执行。调度程序使用确定性调度协议,并行执行事务的各个部分,同时保留定序器指定的串行执行顺序。由于将事务应用于特定状态保证只产生事务指定的更改,并且事务顺序是预先确定的,因此副本不必进一步与定序器通信。

As soon as a transaction batch is successfully replicated, sequencer forwards it to the scheduler, which orchestrates transaction execution. The scheduler uses a deterministic scheduling protocol that executes parts of transaction in parallel, while preserving the serial execution order specified by the sequencer. Since applying transaction to a specific state is guaranteed to produce only changes specified by the transaction and transaction order is predetermined, replicas do not have to further communicate with the sequencer.

每个Calvin 中的事务有一个读集(它的依赖项,是执行它所需的当前数据库状态的数据记录的集合)和一个写集(事务执行的结果;换句话说,它的副作用)。Calvin 本身并不支持依赖于额外读取来确定读取和写入集的事务。

Each transaction in Calvin has a read set (its dependencies, which is a collection of data records from the current database state required to execute it) and a write set (results of the transaction execution; in other words, its side effects). Calvin does not natively support transactions that rely on additional reads that would determine read and write sets.

由调度程序管理的工作线程分四个步骤执行:

A worker thread, managed by the scheduler, proceeds with execution in four steps:

  1. 它分析事务的读取和写入集,从读取集中确定节点本地数据记录,并创建活动参与者列表(即保存写入集元素并对数据执行修改的参与者)。

  2. It analyzes the transaction’s read and write sets, determines node-local data records from the read set, and creates the list of active participants (i.e., ones that hold the elements of the write set, and will perform modifications on the data).

  3. 它收集执行事务所需的本地数据,换句话说,收集恰好驻留在该节点上的读取集记录。收集到的数据记录将转发给相应的活跃参与者。

  4. It collects the local data required to execute the transaction, in other words, the read set records that happen to reside on that node. The collected data records are forwarded to the corresponding active participants.

  5. 如果该工作线程在活动参与者节点上执行,它会接收从其他参与者转发的数据记录,作为步骤 2 中执行的操作的对应部分。

  6. If this worker thread is executing on an active participant node, it receives data records forwarded from the other participants, as a counterpart of the operations executed during step 2.

  7. 最后,它执行一批事务,将结果保存到本地存储中。它不必将执行结果转发到其他节点,因为它们接收相同的事务输入并在本地执行和保存结果。

  8. Finally, it executes a batch of transactions, persisting results into local storage. It does not have to forward execution results to the other nodes, as they receive the same inputs for transactions and execute and persist results locally themselves.

典型的 Calvin 实现将排序器、调度器、工作器和存储子系统放在一起,如图13-7所示。为了确保排序器就哪些事务进入当前纪元/批次达成共识,Calvin 使用 Paxos 共识算法(请参阅“Paxos”)或异步复制,其中专用副本充当领导者。虽然使用领导者可以改善延迟,但它会带来更高的恢复成本,因为节点必须重现故障领导者的状态才能继续。

A typical Calvin implementation colocates sequencer, scheduler, worker, and storage subsystems, as Figure 13-7 shows. To make sure that sequencers reach consensus on exactly which transactions make it into the current epoch/batch, Calvin uses the Paxos consensus algorithm (see “Paxos”) or asynchronous replication, in which a dedicated replica serves as a leader. While using a leader can improve latency, it comes with a higher cost of recovery as nodes have to reproduce the state of the failed leader in order to proceed.

数据库1307
图 13-7。加尔文建筑

使用 Spanner 进行分布式事务

Distributed Transactions with Spanner

卡尔文通常与另一种称为 Spanner [CORBETT12]的分布式事务管理方法进行对比。它的实现(或衍生物)包括几个开源数据库,最著名的是CockroachDBYugaByte DB。Calvin 通过在排序器上达成共识来建立全局事务执行顺序,而 Spanner 在每个分区(换句话说,每个分片)的共识组上使用两阶段提交。Spanner 的设置相当复杂,我们仅在本书范围内介绍高级细节。

Calvin is often contrasted with another approach for distributed transaction management called Spanner [CORBETT12]. Its implementations (or derivatives) include several open source databases, most prominently CockroachDB and YugaByte DB. While Calvin establishes the global transaction execution order by reaching consensus on sequencers, Spanner uses two-phase commit over consensus groups per partition (in other words, per shard). Spanner has a rather complex setup, and we only cover high-level details in the scope of this book.

为了实现一致性并强制执行事务顺序,Spanner 使用TrueTime:一种高精度挂钟 API,它还暴露了不确定性界限,允许本地操作引入人为减慢以等待不确定性界限过去。

To achieve consistency and impose transaction order, Spanner uses TrueTime: a high-precision wall-clock API that also exposes an uncertainty bound, allowing local operations to introduce artificial slowdowns to wait for the uncertainty bound to pass.

扳手提供三种主要的操作类型:读写事务只读事务快照读取。读写事务需要锁、悲观并发控制以及领导者副本的存在。只读事务是无锁的,可以在任何副本上执行。仅在最新时间戳的读取时才需要领导者,该读取从 Paxos 组获取最新提交的值。特定时间戳的读取是一致的,因为值是版本化的,并且快照内容一旦写入就无法更改。每个数据记录都分配有一个时间戳,该时间戳保存事务提交时间的值。这也意味着可以存储记录的多个带时间戳的版本。

Spanner offers three main operation types: read-write transactions, read-only transactions, and snapshot reads. Read-write transactions require locks, pessimistic concurrency control, and presence of the leader replica. Read-only transactions are lock-free and can be executed at any replica. A leader is required only for reads at the latest timestamp, which takes the latest committed value from the Paxos group. Reads at the specific timestamp are consistent, since values are versioned and snapshot contents can’t be changed once written. Each data record has a timestamp assigned, which holds a value of the transaction commit time. This also implies that multiple timestamped versions of the record can be stored.

Spanner架构如图13-8所示。每个 spanserver(副本,向客户端提供数据的服务器实例)拥有多个 片剂, 与Paxos(参见“Paxos”)附加到它们的状态机。副本被分组为称为 Paxos 组的副本集,这是数据放置和复制的单位。每个 Paxos 组都有一个长期领导者(参见“Multi-Paxos”)。领导者在多分片交易期间相互通信。

Figure 13-8 shows the Spanner architecture. Each spanserver (replica, a server instance that serves data to clients) holds several tablets, with Paxos (see “Paxos”) state machines attached to them. Replicas are grouped into replica sets called Paxos groups, a unit of data placement and replication. Each Paxos group has a long-lived leader (see “Multi-Paxos”). Leaders communicate with each other during multishard transactions.

数据库1308
图 13-8。扳手架构

每一个写入必须通过 Paxos 组领导者,而读取可以直接从最新副本上的平板电脑提供。Leader持有一个锁表,用于使用两阶段锁(参见“基于锁的并发控制”)机制实现并发控制,以及负责多分片分布式事务的事务管理器。需要同步的操作(例如事务内的写入和读取)必须从锁表获取锁,而其他操作(快照读取)可以直接访问数据。

Every write has to go through the Paxos group leader, while reads can be served directly from the tablet on up-to-date replicas. The leader holds a lock table that is used to implement concurrency control using the two-phase locking (see “Lock-Based Concurrency Control”) mechanism and a transaction manager that is responsible for multishard distributed transactions. Operations that require synchronization (such as writes and reads within a transaction) have to acquire the locks from the lock table, while other operations (snapshot reads) can access the data directly.

为了多分片事务中,组长必须协调并执行两阶段提交以确保一致性,并使用两阶段锁定来确保隔离。由于 2PC 算法需要所有参与者都在场才能成功提交,因此会损害可用性。Spanner 通过使用 Paxos 组而不是单个节点作为队列来解决这个问题。这意味着即使组中的某些成员出现故障,2PC 也可以继续运行。在 Paxos 组内,2PC 仅联系充当领导者的节点。

For multishard transactions, group leaders have to coordinate and perform a two-phase commit to ensure consistency, and use two-phase locking to ensure isolation. Since the 2PC algorithm requires the presence of all participants for a successful commit, it hurts availability. Spanner solves this by using Paxos groups rather than individual nodes as cohorts. This means that 2PC can continue operating even if some of the members of the group are down. Within the Paxos group, 2PC contacts only the node that serves as a leader.

Paxos 组用于跨多个节点一致地复制事务管理器状态。Paxos领导者首先获取写锁,并选择保证大于任何先前事务时间戳的写时间戳,并prepare通过Paxos记录2PC条目。事务协调器收集时间戳并生成大于任何准备时间戳的提交时间戳,并commit通过 Paxos 记录条目。然后,它会等到它选择提交的时间戳之后,因为它必须保证客户端只会看到时间戳在过去的事务结果。之后,它将这个时间戳发送给客户端和领导者,后者记录commit在本地 Paxos 组中使用新时间戳进行记录,现在可以自由释放锁。

Paxos groups are used to consistently replicate transaction manager states across multiple nodes. The Paxos leader first acquires write locks, and chooses a write timestamp that is guaranteed to be larger than any previous transactions’ timestamp, and records a 2PC prepare entry through Paxos. The transaction coordinator collects timestamps and generates a commit timestamp that is greater than any of the prepare timestamps, and logs a commit entry through Paxos. It then waits until after the timestamp it has chosen for commit, since it has to guarantee that clients will only see transaction results whose timestamps are in the past. After that, it sends this timestamp to the client and leaders, which log the commit record with the new timestamp in their local Paxos group and are now free to release the locks.

单分片事务不必咨询事务管理器(并且随后不必执行跨分区两阶段提交),因为咨询 Paxos 组和锁表足以保证事务顺序和内部一致性碎片。

Single-shard transactions do not have to consult the transaction manager (and, subsequently, do not have to perform a cross-partition two-phase commit), since consulting a Paxos group and a lock table is enough to guarantee transaction order and consistency within the shard.

扳手读写事务提供称为外部一致性的序列化顺序:事务时间戳反映序列化顺序,即使在分布式事务的情况下也是如此。外部一致性具有相当于线性化的实时特性:如果事务在启动之前提交,则 的时间戳小于 的时间戳。T1T2T1T2

Spanner read-write transactions offer a serialization order called external consistency: transaction timestamps reflect serialization order, even in cases of distributed transactions. External consistency has real-time properties equivalent to linearizability: if transaction T1 commits before T2 starts, T1’s timestamp is smaller than the timestamp of T2.

总而言之,Spanner 使用 Paxos 进行一致的事务日志复制、跨分片事务的两阶段提交以及 TrueTime 进行确定性事务排序。这意味着与 Calvin [ABADI17]相比,多分区事务由于额外的两阶段提交回合而具有更高的成本。理解这两种方法都很重要,因为它们允许我们在分区分布式数据存储中执行事务。

To summarize, Spanner uses Paxos for consistent transaction log replication, two-phase commit for cross-shard transactions, and TrueTime for deterministic transaction ordering. This means that multipartition transactions have a higher cost due to an additional two-phase commit round, compared to Calvin [ABADI17]. Both approaches are important to understand since they allow us to perform transactions in partitioned distributes data stores.

数据库分区

Database Partitioning

尽管在讨论 Spanner 和 Calvin 时,我们大量使用了术语“分区” 。现在让我们更详细地讨论它。由于将所有数据库记录存储在单个节点上对于大多数现代应用程序来说相当不现实,因此许多数据库使用分区:将数据逻辑划分为更小的可管理段。

While discussing Spanner and Calvin, we’ve been using the term partitioning quite heavily. Let’s now discuss it in more detail. Since storing all database records on a single node is rather unrealistic for the majority of modern applications, many databases use partitioning: a logical division of data into smaller manageable segments.

对数据进行分区的最直接方法是将其拆分为范围并允许副本集仅管理特定范围(分区)。执行查询时,客户端(或查询协调器)必须根据路由键将请求路由到正确的副本集以进行读取和写入。这种分区方案通常称为分片:每个副本集充当数据子集的单一源。

The most straightforward way to partition data is by splitting it into ranges and allowing replica sets to manage only specific ranges (partitions). When executing queries, clients (or query coordinators) have to route requests based on the routing key to the correct replica set for both reads and writes. This partitioning scheme is typically called sharding: every replica set acts as a single source for a subset of data.

为了最有效地使用分区,必须调整分区的大小,同时考虑负载和值分布。这意味着可以将频繁访问的读/写繁重范围分割成较小的分区,以在它们之间分散负载。同时,如果某些值范围比其他值范围更密集,那么将它们分成更小的分区可能是个好主意。例如,如果我们选择邮政编码作为路由键,由于国家/地区人口分布不均匀,某些邮政编码范围可以分配更多数据(例如人员和订单)。

To use partitions most effectively, they have to be sized, taking the load and value distribution into consideration. This means that frequently accessed, read/write heavy ranges can be split into smaller partitions to spread the load between them. At the same time, if some value ranges are more dense than other ones, it might be a good idea to split them into smaller partitions as well. For example, if we pick zip code as a routing key, since the country population is unevenly spread, some zip code ranges can have more data (e.g., people and orders) assigned to them.

当集群中添加或删除节点时,数据库必须重新分区数据以保持平衡。为了确保一致的移动,我们应该在更新集群元数据并开始将请求路由到新目标之前重新定位数据。一些数据库执行自动分片并使用确定最佳分区的放置算法重新定位数据。这些算法使用有关每个分片中的读取、写入负载和数据量的信息。

When nodes are added to or removed from the cluster, the database has to re-partition the data to maintain the balance. To ensure consistent movements, we should relocate the data before we update the cluster metadata and start routing requests to the new targets. Some databases perform auto-sharding and relocate the data using placement algorithms that determine optimal partitioning. These algorithms use information about read, write loads, and amounts of data in each shard.

从路由键找到目标节点,一些数据库系统计算该键的哈希值,并使用某种形式的从哈希值到节点 ID 的映射。使用哈希函数确定副本放置的优点之一是,它可以帮助减少范围热点,因为哈希值的排序方式与原始值不同。虽然两个字典顺序接近的路由键将被放置在同一个副本集中,但使用散列值会将它们放置在不同的副本集中。

To find a target node from the routing key, some database systems compute a hash of the key, and use some form of mapping from the hash value to the node ID. One of the advantages of using the hash functions for determining replica placement is that it can help to reduce range hot-spotting, since hash values do not sort the same way as the original values. While two lexicographically close routing keys would be placed at the same replica set, using hashed values would place them on different ones.

将哈希值映射到节点 ID 的最直接方法是获取哈希值除以集群大小(模)的余数。N如果系统中有节点,则通过计算 来选取目标节点 ID hash(v) modulo N。这种方法的主要问题是,每当添加或删除节点并且集群大小从 变为 时NN’返回的许多值hash(v) modulo N’将与原始值不同。这意味着大部分数据必须被移动。

The most straightforward way to map hash values to node IDs is by taking a remainder of the division of the hash value by the size of the cluster (modulo). If we have N nodes in the system, the target node ID is picked by computing hash(v) modulo N. The main problem with this approach is that whenever nodes are added or removed and the cluster size changes from N to N’, many values returned by hash(v) modulo N’ will differ from the original ones. This means that most of the data will have to be moved.

一致性哈希

Consistent Hashing

为了缓解这个问题,一些数据库,例如 Apache Cassandra 和 Riak(等等),使用称为一致性哈希的不同分区方案。如前所述,路由键值是经过哈希处理的。哈希函数返回的值被映射到一个,以便在最大可能值之后,它回绕到其最小值。每个节点在环上都有自己的位置,并负责其前任位置和自己位置之间的值范围。

In order to mitigate this problem, some databases, such as Apache Cassandra and Riak (among others), use a different partitioning scheme called consistent hashing. As previously mentioned, routing key values are hashed. Values returned by the hash function are mapped to a ring, so that after the largest possible value, it wraps around to its smallest value. Each node gets its own position on the ring and becomes responsible for the range of values, between its predecessor’s and its own positions.

使用一致散列有助于减少维持平衡所需的重定位次数:环中的更改仅影响离开或加入节点的直接邻居,而不影响整个集群。定义中的“一致”一词意味着,当调整哈希表大小时,如果我们有K可能的哈希键和n节点,平均而言我们只需重新定位K/n键。换句话说,一致的哈希函数输出随着函数范围的变化而变化最小[KARGER97]

Using consistent hashing helps to reduce the number of relocations required for maintaining balance: a change in the ring affects only the immediate neighbors of the leaving or joining node, and not an entire cluster. The word consistent in the definition implies that, when the hash table is resized, if we have K possible hash keys and n nodes, on average we have to relocate only K/n keys. In other words, a consistent hash function output changes minimally as the function range changes [KARGER97].

使用 Percolator 进行分布式事务

Distributed Transactions with Percolator

回到分布式事务的主题,隔离级别由于允许的读写异常,可能很难推理。如果应用程序不需要可串行化,则避免 SQL-92 中描述的写入异常的方法之一是使用称为快照隔离(SI)的事务模型。

Coming back to the subject of distributed transactions, isolation levels might be difficult to reason about because of the allowed read and write anomalies. If serializability is not required by the application, one of the ways to avoid the write anomalies described in SQL-92 is to use a transactional model called snapshot isolation (SI).

快照隔离保证事务中进行的所有读取都与数据库的快照一致。快照包含在事务开始时间戳之前提交的所有值。如果存在写入冲突(即,当两个并发运行的事务尝试写入同一单元时),则只有其中一个会提交。这种特性通常被称为“第一个提交者获胜”

Snapshot isolation guarantees that all reads made within the transaction are consistent with a snapshot of the database. The snapshot contains all values that were committed before the transaction’s start timestamp. If there’s a write-write conflict (i.e., when two concurrently running transactions attempt to make a write to the same cell), only one of them will commit. This characteristic is usually referred to as first committer wins.

快照隔离可以防止读倾斜,这是读提交隔离级别下允许的异常情况。例如, 和 的xy应该是100。事务T1执行操作read(x)并读取值70T2更新两个值write(x, 50)write(y, 50),并提交。如果T1尝试运行read(y),并根据新提交的y( ) 的值继续执行事务,则会导致不一致。提交之前读取的值和新值50T2xT1 T2y彼此不一致。y由于快照隔离仅使特定时间戳之前的值对事务可见,因此, ,的新值对[BERENSON95]50不可见。T1

Snapshot isolation prevents read skew, an anomaly permitted under the read-committed isolation level. For example, a sum of x and y is supposed to be 100. Transaction T1 performs an operation read(x), and reads the value 70. T2 updates two values write(x, 50) and write(y, 50), and commits. If T1 attempts to run read(y), and proceeds with transaction execution based on the value of y (50), newly committed by T2, it will lead to an inconsistency. The value of x that T1 has read before T2 committed and the new value of y aren’t consistent with each other. Since snapshot isolation only makes values up to a specific timestamp visible for transactions, the new value of y, 50, won’t be visible to T1 [BERENSON95].

快照隔离有几个方便的属性:

Snapshot isolation has several convenient properties:

  • 它只允许重复读取已提交的数据。

  • It allows only repeatable reads of committed data.

  • 值是一致的,因为它们是在特定时间戳从快照中读取的。

  • Values are consistent, as they’re read from the snapshot at a specific timestamp.

  • 冲突的写入将被中止并重试以防止不一致。

  • Conflicting writes are aborted and retried to prevent inconsistencies.

尽管如此,快照隔离下的历史记录不可序列化。由于只有对相同单元格的冲突写入才会被中止,因此我们仍然可以得到写入偏差(请参阅“读写异常”)。当两个事务修改不相交的值集时,就会发生写入偏差,每个事务都为其写入的数据保留不变量。允许提交两个事务,但这些事务执行的写入组合可能会违反这些不变量。

Despite that, histories under snapshot isolation are not serializable. Since only conflicting writes to the same cells are aborted, we can still end up with a write skew (see “Read and Write Anomalies”). Write skew occurs when two transactions modify disjoint sets of values, each preserving invariants for the data it writes. Both transactions are allowed to commit, but a combination of writes performed by these transactions may violate these invariants.

快照隔离提供了对许多应用程序有用的语义,并且具有高效读取的主要优点,因为快照数据无法更改,因此无需获取锁。

Snapshot isolation provides semantics that can be useful for many applications and has the major advantage of efficient reads, because no locks have to be acquired since snapshot data cannot be changed.

渗滤器一个在分布式数据库Bigtable之上实现事务 API 的库(请参阅“宽列存储”)。这是在现有系统之上构建交易 API 的一个很好的例子。Percolator 将数据记录、提交的数据点位置(写入元数据)和锁存储在不同的列中。为了避免竞争条件并在单个 RPC 调用中可靠地锁定表,它使用条件突变 Bigtable API,使其能够通过单个远程调用执行读取-修改-写入操作。

Percolator is a library that implements a transactional API on top of the distributed database Bigtable (see “Wide Column Stores”). This is a great example of building a transaction API on top of the existing system. Percolator stores data records, committed data point locations (write metadata), and locks in different columns. To avoid race conditions and reliably lock tables in a single RPC call, it uses a conditional mutation Bigtable API that allows it to perform read-modify-write operations with a single remote call.

每个事务必须查询时间戳预言机(集群范围内一致的单调递增时间戳的来源)两次:对于事务开始时间戳,以及在提交期间。使用客户端驱动的两阶段提交来缓冲和提交写入(请参阅“两阶段提交”)。

Each transaction has to consult the timestamp oracle (a source of clusterwide-consistent monotonically increasing timestamps) twice: for a transaction start timestamp, and during commit. Writes are buffered and committed using a client-driven two-phase commit (see “Two-Phase Commit”).

图 13-9显示了在事务步骤执行期间表的内容如何变化:

Figure 13-9 shows how the contents of the table change during execution of the transaction steps:

  • a) 初始状态。执行上一笔交易后,TS1是两个账户的最新时间戳。没有锁被持有。

  • a) Initial state. After the execution of the previous transaction, TS1 is the latest timestamp for both accounts. No locks are held.

  • b) 第一阶段,称为prewrite。事务尝试获取事务期间写入的所有单元的锁。其中一个锁被标记为主锁用于客户端恢复。事务检查可能的冲突:是否有任何其他事务已经写入了具有较晚时间戳的任何数据,或者在任何时间戳处存在未释放的锁。如果检测到任何冲突,事务将中止。

  • b) The first phase, called prewrite. The transaction attempts to acquire locks for all cells written during the transaction. One of the locks is marked as primary and is used for client recovery. The transaction checks for the possible conflicts: if any other transaction has already written any data with a later timestamp or there are unreleased locks at any timestamp. If any conflict is detected, the transaction aborts.

  • c) 如果所有锁都成功获取并且排除了冲突的可能性,则事务可以继续。在第二阶段,客户端从主锁开始释放其锁。它通过用写入记录替换锁来发布其写入,并使用最新数据点的时间戳更新写入元数据。

  • c) If all locks were successfully acquired and the possibility of conflict is ruled out, the transaction can continue. During the second phase, the client releases its locks, starting with the primary one. It publishes its write by replacing the lock with a write record, updating write metadata with the timestamp of the latest data point.

由于客户端在尝试提交事务时可能会失败,因此我们需要确保部分事务已完成或回滚。如果后面的事务遇到不完整状态,它应该尝试释放主锁并提交事务。如果主锁已经释放,则必须提交事务内容。一次只有一个事务可以持有锁,并且所有状态转换都是原子的,因此两个事务尝试对内容执行操作的情况是不可能的。

Since the client may fail while trying to commit the transaction, we need to make sure that partial transactions are finalized or rolled back. If a later transaction encounters an incomplete state, it should attempt to release the primary lock and commit the transaction. If the primary lock is already released, transaction contents have to be committed. Only one transaction can hold a lock at a time and all state transitions are atomic, so situations in which two transactions attempt to perform operations on the contents are not possible.

1309 号
图 13-9。Percolator 事务执行步骤。交易从 Account2 贷记 150 美元,并将其记入 Account1 借方。

快照隔离是一个重要且有用的抽象,常用于事务处理中。由于它简化了语义,排除了一些异常情况,并提供了提高并发性和性能的机会,因此许多 MVCC 系统都提供这种隔离级别。

Snapshot isolation is an important and useful abstraction, commonly used in transaction processing. Since it simplifies semantics, precludes some of the anomalies, and opens up an opportunity to improve concurrency and performance, many MVCC systems offer this isolation level.

基于 Percolator 模型的数据库的例子之一是TiDB(“Ti”代表 Titatium)。TiDB 是一个强一致、高可用、可水平扩展的开源数据库,兼容 MySQL。

One of the examples of databases based on the Percolator model is TiDB (“Ti” stands for Titatium). TiDB is a strongly consistent, highly available, and horizontally scalable open source database, compatible with MySQL.

协调回避

Coordination Avoidance

更多的例子,讨论可串行化的成本并尝试减少协调量,同时仍然提供强一致性保证,这就是避免协调[BAILIS14b]。可以避免协调,同时保留数据完整性约束,如果操作是恒定汇合的。不变合流(I -Confluence)被定义为一个属性,确保两个不变有效但有分歧的数据库状态可以合并为一个有效的最终状态。在这种情况下,不变量可以保持 ACID 术语的一致性。

One more example, discussing costs of serializability and attempting to reduce the amount of coordination while still providing strong consistency guarantees, is coordination avoidance [BAILIS14b]. Coordination can be avoided, while preserving data integrity constraints, if operations are invariant confluent. Invariant Confluence (I-Confluence) is defined as a property that ensures that two invariant-valid but diverged database states can be merged into a single valid, final state. Invariants in this case preserve consistency in ACID terms.

由于任意两个有效状态都可以合并为一个有效状态,因此无需额外协调即可执行I -Confluence 操作,这显着提高了性能特征和可扩展性潜力。

Because any two valid states can be merged into a valid state, I-Confluent operations can be executed without additional coordination, which significantly improves performance characteristics and scalability potential.

保留这个不变量,除了定义一个将数据库带到新状态的操作之外,我们还必须定义一个接受两个状态的合并函数。此函数用于状态独立更新并使发散状态恢复收敛的情况。

To preserve this invariant, in addition to defining an operation that brings our database to the new state, we have to define a merge function that accepts two states. This function is used in case states were updated independently and bring diverged states back to convergence.

事务是针对本地数据库版本(快照)执行的。如果事务需要其他分区的任何状态来执行,则该状态在本地可用。如果事务提交,对本地快照所做的更改将被迁移并与其他节点上的快照合并。允许避免协调的系统模型必须保证以下属性:

Transactions are executed against the local database versions (snapshots). If a transaction requires any state from other partitions for execution, this state is made available for it locally. If a transaction commits, resulting changes made to the local snapshot are migrated and merged with the snapshots on the other nodes. A system model that allows coordination avoidance has to guarantee the following properties:

全球有效性
Global validity

对于合并和发散提交的数据库状态,所需的不变量始终得到满足,并且事务无法观察到无效状态。

Required invariants are always satisfied, for both merged and divergent committed database states, and transactions cannot observe invalid states.

可用性
Availability

如果客户端可以访问所有持有状态的节点,则事务必须做出提交决定,如果提交会违反事务不变量之一,则事务必须中止。

If all nodes holding states are reachable by the client, the transaction has to reach a commit decision, or abort, if committing it would violate one of the transaction invariants.

收敛
Convergence

节点可以独立维护其本地状态,但在没有进一步交易和无限网络分区的情况下,它们必须能够达到相同的状态。

Nodes can maintain their local states independently, but in the absence of further transactions and indefinite network partitions, they have to be able to reach the same state.

协调自由度
Coordination freedom

本地事务执行独立于代表其他节点执行的针对本地状态的操作。

Local transaction execution is independent from the operations against the local states performed on behalf of the other nodes.

实现协调避免的示例之一是读取原子多分区(RAMP)事务[BAILIS14c]。RAMP 使用多版本并发控制和当前正在进行的操作的元数据来从其他节点获取任何丢失的状态更新,从而允许同时执行读写操作。例如,可以检测到与修改同一条目的某个写入器重叠的读取器,并且如果需要,可以通过在另一轮通信中从正在进行的写入元数据检索所需信息来修复这些读取器。

One of the examples of implementing coordination avoidance is Read-Atomic Multi Partition (RAMP) transactions [BAILIS14c]. RAMP uses multiversion concurrency control and metadata of current in-flight operations to fetch any missing state updates from other nodes, allowing read and write operations to be executed concurrently. For example, readers that overlap with some writer modifying the same entry can be detected and, if necessary, repaired by retrieving required information from the in-flight write metadata in an additional round of communication.

在分布式环境中使用基于锁的方法可能不是最好的主意,RAMP 没有这样做,而是提供了两个属性:

Using lock-based approaches in a distributed environment might be not the best idea, and instead of doing that, RAMP provides two properties:

同步独立性
Synchronization independence

一个客户端的事务不会停止、中止或强制另一客户端的事务等待。

One client’s transactions won’t stall, abort, or force the other client’s transactions to wait.

分区独立性
Partition independence

客户端不必联系其值不参与其事务的分区。

Clients do not have to contact partitions whose values aren’t involved in their transactions.

斜坡引入读原子隔离级别:事务无法观察到正在进行的、未提交的和中止的事务中的任何进程内状态更改。换句话说,所有(或没有)事务更新对于并发事务都是可见的。根据该定义,读取原子隔离level 还排除了碎片读取:当一个事务仅观察到由其他事务执行的写入的子集时。

RAMP introduces the read atomic isolation level: transactions cannot observe any in-process state changes from in-flight, uncommitted, and aborted transactions. In other words, all (or none) transaction updates are visible to concurrent transactions. By that definition, the read atomic isolation level also precludes fractured reads: when a transaction observes only a subset of writes executed by some other transaction.

RAMP 提供原子写入可见性,无需互斥,而其他解决方案(例如分布式锁)通常将其结合在一起。这意味着事务可以继续进行而不会互相拖延。

RAMP offers atomic write visibility without requiring mutual exclusion, which other solutions, such as distributed locks, often couple together. This means that transactions can proceed without stalling each other.

RAMP 分发事务元数据,允许读取检测并发的正在进行的写入。通过使用此元数据,事务可以检测是否存在较新的记录版本,查找并获取最新的记录版本,并对它们进行操作。为了避免协调,所有本地提交决策也必须在全局范围内有效。在 RAMP 中,这一问题的解决方法是要求当写入在一个分区中变得可见时,来自所有其他相关分区中的同一事务的写入对于这些分区中的读取器也可见。

RAMP distributes transaction metadata that allows reads to detect concurrent in-flight writes. By using this metadata, transactions can detect the presence of newer record versions, find and fetch the latest ones, and operate on them. To avoid coordination, all local commit decisions must also be valid globally. In RAMP, this is solved by requiring that, by the time a write becomes visible in one partition, writes from the same transaction in all other involved partitions are also visible for readers in those partitions.

为了允许读取器和写入器在不阻塞其他并发读取器和写入器的情况下继续进行,同时在本地和系统范围内(在提交事务修改的所有其他分区中)维持读取原子隔离级别,RAMP 中的写入被安装并使用两个命令使其可见-阶段提交:

To allow readers and writers to proceed without blocking other concurrent readers and writers, while maintaining the read atomic isolation level both locally and system-wide (in all other partitions modified by the committing transaction), writes in RAMP are installed and made visible using two-phase commit:

准备
Prepare

第一阶段准备写入并将其写入各自的目标分区,而不使它们可见。

The first phase prepares and places writes to their respective target partitions without making them visible.

提交/中止
Commit/abort

第二阶段发布由提交事务的写操作所做的状态更改,使它们在所有分区上原子可用,或者回滚更改。

The second phase publishes the state changes made by the write operation of the committing transaction, making them available atomically across all partitions, or rolls back the changes.

RAMP 允许同一记录在任何给定时刻存在多个版本:最新值、正在进行的未提交更改以及过时版本(被后续事务覆盖)。过时的版本必须仅针对正在进行的读取请求而保留。一旦所有并发读取器完成,过时的值就可以被丢弃。

RAMP allows multiple versions of the same record to be present at any given moment: latest value, in-flight uncommitted changes, and stale versions, overwritten by later transactions. Stale versions have to be kept around only for in-progress read requests. As soon as all concurrent readers complete, stale values can be discarded.

由于与预防、检测和避免并发操作冲突相关的协调开销,使分布式事务具有高性能和可扩展性是很困难的。系统越大,或者尝试服务的事务越多,产生的开销就越大。本节中描述的方法尝试通过使用不变量来确定可以避免协调的位置来减少协调量,并且仅在绝对必要时才支付全部费用。

Making distributed transactions performant and scalable is difficult because of the coordination overhead associated with preventing, detecting, and avoiding conflicts for the concurrent operations. The larger the system, or the more transactions it attempts to serve, the more overhead it incurs. The approaches described in this section attempt to reduce the amount of coordination by using invariants to determine where coordination can be avoided, and only paying the full price if it’s absolutely necessary.

概括

Summary

在本章中,我们讨论了实现分布式事务的几种方法。首先,我们讨论了两种原子提交算法:两阶段提交和三相提交。这些算法的一大优点是易于理解和实现,但也有一些缺点。在 2PC 中,协调者(或至少是其替代者)必须在提交过程期间保持活动状态,这会显着降低可用性。3PC在某些情况下解除了这一要求,但在网络分区的情况下很容易出现脑裂。

In this chapter, we discussed several ways of implementing distributed transactions. First, we discussed two atomic commitment algorithms: two- and three-phase commits. The big advantage of these algorithms is that they’re easy to understand and implement, but have several shortcomings. In 2PC, a coordinator (or at least its substitute) has to be alive for the length of the commitment process, which significantly reduces availability. 3PC lifts this requirement for some cases, but is prone to split brain in case of network partition.

现代数据库系统中的分布式事务通常使用共识算法来实现,我们将在下一章中讨论。例如,本章讨论的 Calvin 和 Spanner 都使用 Paxos。

Distributed transactions in modern database systems are often implemented using consensus algorithms, which we’re going to discuss in the next chapter. For example, both Calvin and Spanner, discussed in this chapter, use Paxos.

共识算法比原子提交算法涉及更多,但具有更好的容错特性,并将决策与其发起者解耦,并允许参与者决定一个值,不是决定是否接受该值 [GRAY04]

Consensus algorithms are more involved than atomic commit ones, but have much better fault-tolerance properties, and decouple decisions from their initiators and allow participants to decide on a value rather than on whether or not to accept the value [GRAY04].

1细则上写着“假设网络高度可靠”。换句话说,一个排除分区的网络[ALHOUMAILY10]。该假设的含义将在本文有关算法描述的部分中讨论。

1 The fine print says “assuming a highly reliable network.” In other words, a network that precludes partitions [ALHOUMAILY10]. Implications of this assumption are discussed in the paper’s section about algorithm description.

2然而,文档称,从 v3.6 开始,2PC 只提供类似事务的语义: https: //databass.dev/links/7

2 However, the documentation says that as of v3.6, 2PC provides only transaction-like semantics: https://databass.dev/links/7.

第 14 章共识

Chapter 14. Consensus

我们已经讨论了分布式系统中的很多概念,从基础知识开始,例如链路和进程、分布式计算的问题;然后通过故障模型、故障检测器和领导者选举;讨论了一致性模型;我们终于准备好将所有这些放在一起,实现分布式系统研究的顶峰:分布式共识。

We’ve discussed quite a few concepts in distributed systems, starting with basics, such as links and processes, problems with distributed computing; then going through failure models, failure detectors, and leader election; discussed consistency models; and we’re finally ready to put it all together for a pinnacle of distributed systems research: distributed consensus.

分布式系统中的共识算法允许多个进程就某个值达成一致。FLP 不可能性(参见“FLP 不可能性”)表明,在完全异步的系统中不可能在有限的时间内保证共识。即使消息传递得到保证,一个进程也不可能知道另一个进程是否崩溃或运行缓慢。

Consensus algorithms in distributed systems allow multiple processes to reach an agreement on a value. FLP impossibility (see “FLP Impossibility”) shows that it is impossible to guarantee consensus in a completely asynchronous system in a bounded time. Even if message delivery is guaranteed, it is impossible for one process to know whether the other one has crashed or is running slowly.

第 9 章中,我们讨论了故障检测精度和故障检测速度之间的权衡。共识算法采用异步模型并保证安全性,而外部故障检测器可以提供有关其他进程的信息,保证活跃性[CHANDRA96]。由于故障检测并不总是完全准确,因此会出现共识算法等待检测到进程故障的情况,或者由于错误地怀疑某个进程有故障而重新启动算法的情况。

In Chapter 9, we discussed that there’s a trade-off between failure-detection accuracy and how quickly the failure can be detected. Consensus algorithms assume an asynchronous model and guarantee safety, while an external failure detector can provide information about other processes, guaranteeing liveness [CHANDRA96]. Since failure detection is not always fully accurate, there will be situations when a consensus algorithm waits for a process failure to be detected, or when the algorithm is restarted because some process is incorrectly suspected to be faulty.

流程必须就参与者之一提出的某些价值达成一致,即使其中一些流程发生崩溃。如果进程没有崩溃并继续执行算法步骤,则称进程是正确的。共识对于将事件按特定顺序排列并确保参与者之间的一致性非常有用。通过共识,我们可以拥有一个系统,流程可以从一个值转移到下一个值,而不会失去客户观察到哪些值的确定性。

Processes have to agree on some value proposed by one of the participants, even if some of them happen to crash. A process is said to be correct if hasn’t crashed and continues executing algorithm steps. Consensus is extremely useful for putting events in a particular order, and ensuring consistency among the participants. Using consensus, we can have a system where processes move from one value to the next one without losing certainty about which values the clients observe.

从理论角度来看,共识算法具有三个属性:

From a theoretical perspective, consensus algorithms have three properties:

协议
Agreement

对于所有正确的过程,决策值都是相同的。

The decision value is the same for all correct processes.

有效性
Validity

所决定的值是由其中一个过程提出的。

The decided value was proposed by one of the processes.

终止
Termination

所有正确的过程最终都会做出决定。

All correct processes eventually reach the decision.

这些属性中的每一项都极其重要。该协议植根于人类对共识的理解中。词典对共识的定义有“一致同意”一词。这意味着根据协议,任何流程都不允许对结果有不同意见。将其视为在特定时间和地点与朋友见面的协议:你们所有人都想见面,并且只就活动的具体细节达成一致。

Each one of these properties is extremely important. The agreement is embedded in the human understanding of consensus. The dictionary definition of consensus has the word “unanimity” in it. This means that upon the agreement, no process is allowed to have a different opinion about the outcome. Think of it as an agreement to meet at a particular time and place with your friends: all of you would like to meet, and only the specifics of the event are being agreed upon.

有效性至关重要,因为没有有效性,共识就变得微不足道。共识算法要求所有进程就某些值达成一致。如果流程使用一些预先确定的、任意的默认值作为决策输出,而不管建议的值如何,它们将达成一致,但这种算法的输出将无效,并且在现实中没有用处。

Validity is essential, because without it consensus can be trivial. Consensus algorithms require all processes to agree on some value. If processes use some predetermined, arbitrary default value as a decision output regardless of the proposed values, they will reach unanimity, but the output of such an algorithm will not be valid and it wouldn’t be useful in reality.

如果没有终止,我们的算法将永远继续下去而不会得出任何结论,或者将无限期地等待崩溃的进程回来,这也不是很有用。流程最终必须达成一致,并且为了使共识算法切实可行,这必须相当快地发生。

Without termination, our algorithm will continue forever without reaching any conclusion or will wait indefinitely for a crashed process to come back, which is not very useful, either. Processes have to agree eventually and, for a consensus algorithm to be practical, this has to happen rather quickly.

播送

Broadcast

广播是_分布式系统中经常使用的通信抽象。广播算法用于在一组进程之间传播信息。存在多种广播算法,做出不同的假设并提供不同的保证。广播是一个重要的原语,被用在很多地方,包括共识算法。我们已经讨论了广播的一种形式——八卦传播(参见“八卦传播”)。

A broadcast is a communication abstraction often used in distributed systems. Broadcast algorithms are used to disseminate information among a set of processes. There exist many broadcast algorithms, making different assumptions and providing different guarantees. Broadcast is an important primitive and is used in many places, including consensus algorithms. We’ve discussed one of the forms of broadcast—gossip dissemination—already (see “Gossip Dissemination”).

当单个协调器节点必须将数据分发给所有其他参与者时,广播通常用于数据库复制。然而,使这个过程可靠并不是一件小事:如果协调器在将消息分发到某些节点而不是其他节点后崩溃,那么系统就会处于不一致的状态:一些节点观察到新消息,而另一些则没有。

Broadcasts are often used for database replication when a single coordinator node has to distribute the data to all other participants. However, making this process reliable is not a trivial matter: if the coordinator crashes after distributing the message to some nodes but not the other ones, it leaves the system in an inconsistent state: some of the nodes observe a new message and some do not.

广播消息最简单、最直接的方法是通过尽力广播 [CACHIN11]。在这种情况下,发送者负责确保消息传递到所有目标。如果失败,其他参与者不会尝试重新广播该消息,并且在协调器崩溃的情况下,这种类型的广播将默默失败。

The simplest and the most straightforward way to broadcast messages is through a best effort broadcast [CACHIN11]. In this case, the sender is responsible for ensuring message delivery to all the targets. If it fails, the other participants do not try to rebroadcast the message, and in the case of coordinator crash, this type of broadcast will fail silently.

为了广播要想可靠,就需要保证所有正确的进程都收到相同的消息,即使发送者在传输过程中崩溃。

For a broadcast to be reliable, it needs to guarantee that all correct processes receive the same messages, even if the sender crashes during transmission.

为了实现可靠广播的简单版本,我们可以使用故障检测器和回退机制。最直接的回退机制是允许每个接收到消息的进程将其转发到它知道的每个其他进程。当源进程发生故障时,其他进程会检测到故障并继续广播消息,从而有效地用消息淹没网络(如图14-1所示)。即使发送方崩溃了,系统的其余部分仍然会拾取并传递消息,从而提高其可靠性,并允许所有接收方看到相同的消息[CACHIN11]N2

To implement a naive version of a reliable broadcast, we can use a failure detector and a fallback mechanism. The most straightforward fallback mechanism is to allow every process that received the message to forward it to every other process it’s aware of. When the source process fails, other processes detect the failure and continue broadcasting the message, effectively flooding the network with N2 messages (as shown in Figure 14-1). Even if the sender has crashed, messages still are picked up and delivered by the rest of the system, improving its reliability, and allowing all receivers to see the same messages [CACHIN11].

数据库1401
图 14-1。播送

这种方法的缺点之一是它使用消息,其中是剩余接收者的数量(因为每个广播进程都不包括原始进程及其本身)。理想情况下,我们希望减少可靠广播所需的消息数量。N2N

One of the downsides of this approach is the fact that it uses N2 messages, where N is the number of remaining recipients (since every broadcasting process excludes the original process and itself). Ideally, we’d want to reduce the number of messages required for a reliable broadcast.

原子广播

Atomic Broadcast

甚至虽然刚才描述的洪泛算法可以确保消息传递,但它不保证以任何特定顺序传递。消息最终会在未知的时间到达目的地。如果我们需要按顺序传递消息,我们必须使用原子广播(也称为全序多播),它既保证了可靠传递又保证了全序。

Even though the flooding algorithm just described can ensure message delivery, it does not guarantee delivery in any particular order. Messages reach their destination eventually, at an unknown time. If we need to deliver messages in order, we have to use the atomic broadcast (also called the total order multicast), which guarantees both reliable delivery and total order.

虽然可靠的广播确保进程就所传递的消息集达成一致,但原子广播还确保它们就相同的消息序列达成一致(即,每个目标的消息传递顺序都相同)。

While a reliable broadcast ensures that the processes agree on the set of messages delivered, an atomic broadcast also ensures they agree on the same sequence of messages (i.e., message delivery order is the same for every target).

总之,原子广播必须确保两个基本属性:

In summary, an atomic broadcast has to ensure two essential properties:

原子性
Atomicity

进程必须就接收到的消息集达成一致。要么所有未失败的进程都传递消息,要么没有一个进程传递消息。

Processes have to agree on the set of received messages. Either all nonfailed processes deliver the message, or none do.

命令
Order

所有未失败的进程都以相同的顺序传递消息。

All nonfailed processes deliver the messages in the same order.

这里的消息以原子方式传递:每条消息要么传递到所有进程,要么不传递到任何进程,如果传递了该消息,则所有其他消息都将排序在该消息之前或之后。

Messages here are delivered atomically: every message is either delivered to all processes or none of them and, if the message is delivered, every other message is ordered before or after this message.

虚拟同步

Virtual Synchrony

使用广播进行组通信的框架称为虚拟同步。原子广播有助于将完全有序的消息传递给静态进程组,而虚拟同步则将完全有序的消息传递给动态对等组。

One of the frameworks for group communication using broadcast is called virtual synchrony. An atomic broadcast helps to deliver totally ordered messages to a static group of processes, and virtual synchrony delivers totally ordered messages to a dynamic group of peers.

虚拟同步将进程组织成组。只要该组存在,消息就会以相同的顺序传递给其所有成员。在这种情况下,模型未指定顺序,并且某些实现可以利用这一点来提高性能,只要它们提供的顺序在所有成员中保持一致 [BIRMAN10 ]

Virtual synchrony organizes processes into groups. As long as the group exists, messages are delivered to all of its members in the same order. In this case, the order is not specified by the model, and some implementations can take this to their advantage for performance gains, as long as the order they provide is consistent across all members [BIRMAN10].

进程具有相同的组视图,消息与组标识相关联:只有属于同一组的进程才能看到相同的消息。

Processes have the same view of the group, and messages are associated with the group identity: processes can see the identical messages only as long as they belong to the same group.

一旦其中一名参与者加入、离开群组,或者失败并被迫退出,群组视图就会发生变化。这是通过向所有成员宣布群组变更来实现的。每条消息都与其所源自的组唯一关联。

As soon as one of the participants joins, leaves the group, or fails and is forced out of it, the group view changes. This happens by announcing the group change to all its members. Each message is uniquely associated with the group it has originated from.

虚拟同步区分消息接收(当组成员收到消息时)和消息传递当所有组成员收到消息时发生)。如果消息是在一个视图中发送的,则只能在同一视图下传递,这可以通过比较当前组和消息关联的组来确定。收到的消息将在队列中保持待处理状态,直到进程收到成功传递的通知。

Virtual synchrony distinguishes between the message receipt (when a group member receives the message) and its delivery (which happens when all the group members receive the message). If the message was sent in one view, it can be delivered only in the same view, which can be determined by comparing the current group with the group the message is associated with. Received messages remain pending in the queue until the process is notified about successful delivery.

由于每条消息都属于一个特定的组,除非该组中的所有进程在视图更改之前都已收到该消息,否则任何组成员都不能认为该消息已传递这意味着所有消息都在视图更改之间发送和传递,这为我们提供了原子传递保证。在这种情况下,群组视图就成为消息广播无法通过的屏障。

Since every message belongs to a specific group, unless all processes in the group have received it before the view change, no group member can consider this message delivered. This implies that all messages are sent and delivered between the view changes, which gives us atomic delivery guarantees. In this case, group views serve as a barrier that message broadcasts cannot pass.

一些全广播算法通过使用负责确定消息的单个进程(定序器)来对消息进行排序。此类算法更容易实现,但依赖于检测领导者的活跃度故障。使用定序器可以提高性能,因为我们不需要在每个消息的进程之间建立共识,并且可以使用定序器本地视图来代替。这种方法仍然可以通过划分请求来扩展。

Some total broadcast algorithms order messages by using a single process (sequencer) that is responsible for determining it. Such algorithms can be easier to implement, but rely on detecting the leader failures for liveness. Using a sequencer can improve performance, since we do not need to establish consensus between processes for every message, and can use a sequencer-local view instead. This approach can still scale by partitioning the requests.

尽管技术健全,但虚拟同步尚未得到广泛采用,并且在最终用户商业系统中并不常用[BIRMAN06]

Despite its technical soundness, virtual synchrony has not received broad adoption and isn’t commonly used in end-user commercial systems [BIRMAN06].

Zookeeper 原子广播 (ZAB)

Zookeeper Atomic Broadcast (ZAB)

最流行和广为人知的原子广播实现是Apache Zookeeper [HUNT10] [JUNQUEIRA11]使用的 ZAB ,它是一种分层分布式键值存储,用于确保事件的总顺序和保持一致性所需的原子传递副本状态之间。

One of the most popular and widely known implementations of the atomic broadcast is ZAB used by Apache Zookeeper [HUNT10] [JUNQUEIRA11], a hierarchical distributed key-value store, where it’s used to ensure the total order of events and atomic delivery necessary to maintain consistency between the replica states.

流程在 ZAB 中可以扮演两种角色之一:领导者追随者。领导者是一个临时角色。它通过执行算法步骤来驱动进程,向关注者广播消息,并建立事件顺序。为了写入新记录并执行观察最新值的读取,客户端连接到集群中的节点之一。如果该节点恰好是领导者,它将处理该请求。否则,它将请求转发给领导者。

Processes in ZAB can take on one of two roles: leader and follower. Leader is a temporary role. It drives the process by executing algorithm steps, broadcasts messages to followers, and establishes the event order. To write new records and execute reads that observe the most recent values, clients connect to one of the nodes in the cluster. If the node happens to be a leader, it will handle the request. Otherwise, it forwards the request to the leader.

为了保证领导者的唯一性,协议时间线被分为 epochs,用唯一的单调和增量序列号标识。任何时代,只能有一位领导者。该过程从发现开始使用任何选举算法的潜在领导者,只要它选择一个概率较高的进程即可。由于安全性是通过进一步的算法步骤来保证的,因此确定潜在领导者更多的是性能优化。未来的领导者也可能因前任领导者的失败而出现。

To guarantee leader uniqueness, the protocol timeline is split into epochs, identified with a unique monotonically- and incrementally-sequenced number. During any epoch, there can be only one leader. The process starts from finding a prospective leader using any election algorithm, as long as it chooses a process that is up with a high probability. Since safety is guaranteed by the further algorithm steps, determining a prospective leader is more of a performance optimization. A prospective leader can also emerge as a consequence of the previous leader’s failure.

一旦建立了潜在的领导者,它就会分三个阶段执行协议:

As soon as a prospective leader is established, it executes a protocol in three phases:

发现
Discovery

未来的领导者了解所有其他进程已知的最新纪元,并提出一个比任何追随者当前纪元更大的新纪元。追随者使用上一个纪元中看到的最新交易的标识符来响应纪元提案。在此步骤之后,任何进程都不会接受早期时期的广播提案。

The prospective leader learns about the latest epoch known by every other process, and proposes a new epoch that is greater than the current epoch of any follower. Followers respond to the epoch proposal with the identifier of the latest transaction seen in the previous epoch. After this step, no process will accept broadcast proposals for the earlier epochs.

同步
Synchronization

此阶段用于从前一个领导者的失败中恢复并让落后的追随者加快速度。未来的领导者向追随者发送一条消息,提议自己作为新时代的领导者,并收集他们的致谢。一旦收到确认,领导者就被建立。在此步骤之后,追随者将不会接受任何其他进程成为纪元领导者的尝试。在同步过程中,新领导者确保追随者具有相同的历史,并提供早期时代的既定领导者的承诺提案。这些提案是在新纪元的任何提案提交之前提交的。

This phase is used to recover from the previous leader’s failure and bring lagging followers up to speed. The prospective leader sends a message to the followers proposing itself as a leader for the new epoch and collects their acknowledgments. As soon as acknowledgments are received, the leader is established. After this step, followers will not accept attempts to become the epoch leader from any other processes. During synchronization, the new leader ensures that followers have the same history and delivers committed proposals from the established leaders of earlier epochs. These proposals are delivered before any proposal from the new epoch is delivered.

播送
Broadcast

一旦关注者恢复同步,活动消息就会开始。在此阶段,领导者接收客户端消息,建立订单,并将其广播给追随者:它发送一个新提案,等待法定数量的追随者以确认响应,最后提交它。这个过程类似于没有中止的两阶段提交:投票只是确认,客户端不能投票反对有效领导者的提案。然而,不正确时代领导人的建议不应接受。广播阶段一直持续到leader崩溃、与follower分离或者怀疑由于消息延迟而崩溃。

As soon as the followers are back in sync, active messaging starts. During this phase, the leader receives client messages, establishes their order, and broadcasts them to the followers: it sends a new proposal, waits for a quorum of followers to respond with acknowledgments and, finally, commits it. This process is similar to a two-phase commit without aborts: votes are just acknowledgments, and the client cannot vote against a valid leader’s proposal. However, proposals from the leaders from incorrect epochs should not be acknowledged. The broadcast phase continues until the leader crashes, is partitioned from the followers, or is suspected to be crashed due to the message delay.

图 14-2显示了 ZAB 算法的三个阶段以及每个步骤中交换的消息。

Figure 14-2 shows the three phases of the ZAB algorithm, and messages exchanged during each step.

数据库1402
图 14-2。ZAB协议总结

如果追随者确保他们只接受来自既定时代领导者的提案,则该协议的安全性得到保证。两个进程可能会尝试当选,但只有其中一个能够获胜并确立自己为纪元领导者的地位。还假设流程真诚地执行规定的步骤并遵循协议。

The safety of this protocol is guaranteed if followers ensure they accept proposals only from the leader of the established epoch. Two processes may attempt to get elected, but only one of them can win and establish itself as an epoch leader. It is also assumed that processes perform the prescribed steps in good faith and follow the protocol.

领导者和追随者都依靠心跳来确定远程进程的活跃度。如果领导者没有收到法定人数的追随者的心跳,它将辞去领导者职务,并重新启动选举过程。同样,如果其中一个追随者确定领导者崩溃了,它就会开始新的选举过程。

Both the leader and followers rely on heartbeats to determine the liveness of the remote processes. If the leader does not receive heartbeats from the quorum of followers, it steps down as a leader, and restarts the election process. Similarly, if one of the followers has determined the leader crashed, it starts a new election process.

消息是完全有序的,领导者不会尝试发送下一条消息,直到前面的消息得到确认。即使某些消息被关注者多次接收,只要遵循传递顺序,它们的重复应用也不会产生额外的副作用。ZAB 能够处理来自客户端的多个未完成的并发状态更改,因为唯一的领导者将接收写入请求、建立事件顺序并广播更改。

Messages are totally ordered, and the leader will not attempt to send the next message until the message that preceded it was acknowledged. Even if some messages are received by a follower more than once, their repeated application do not produce additional side effects, as long as delivery order is followed. ZAB is able to handle multiple outstanding concurrent state changes from clients, since a unique leader will receive write requests, establish the event order, and broadcast the changes.

总消息顺序还允许 ZAB 提高恢复效率。在同步阶段,追随者以最高承诺的提案进行响应。领导者可以简单地选择具有最高建议的节点进行恢复,这可能是唯一需要复制消息的节点。

Total message order also allows ZAB to improve recovery efficiency. During the synchronization phase, followers respond with a highest committed proposal. The leader can simply choose the node with the highest proposal for recovery, and this can be the only node messages have to be copied from.

ZAB 的优点之一是它的效率:广播过程只需要两轮消息,并且可以通过从单个最新进程流式传输丢失的消息来恢复领导者故障。拥有一个长寿的领导者可以对性能产生积极的影响:我们不需要额外的共识轮次来建立事件的历史,因为领导者可以根据其本地视图对它们进行排序。

One of the advantages of ZAB is its efficiency: the broadcast process requires only two rounds of messages, and leader failures can be recovered from by streaming the missing messages from a single up-to-date process. Having a long-lived leader can have a positive impact on performance: we do not require additional consensus rounds to establish a history of events, since the leader can sequence them based on its local view.

帕克索斯

Paxos

一个原子广播是一个相当于具有崩溃故障的异步系统中的共识问题[CHANDRA96],因为参与者必须就消息顺序达成一致并且必须能够了解它。您将看到原子广播和共识算法在动机和实现方面有许多相似之处。

An atomic broadcast is a problem equivalent to consensus in an asynchronous system with crash failures [CHANDRA96], since participants have to agree on the message order and must be able to learn about it. You will see many similarities in both motivation and implementation between atomic broadcast and consensus algorithms.

最广为人知的共识算法可能是Paxos。它由 Leslie Lamport 在“兼职议会”论文[LAMPORT98]中首次介绍。在本文中,共识是用受爱琴海 Paxos 岛立法和投票过程启发的术语来描述的。2001 年,作者发布了一篇名为“Paxos Made Simple” [LAMPORT01]的后续论文,引入了更简单的术语,这些术语现在通常用于解释该算法。

Probably the most widely known consensus algorithm is Paxos. It was first introduced by Leslie Lamport in “The Part-Time Parliament” paper [LAMPORT98]. In this paper, consensus is described in terms of terminology inspired by the legislative and voting process on the Aegian island of Paxos. In 2001, the author released a follow-up paper titled “Paxos Made Simple” [LAMPORT01] that introduced simpler terms, which are now commonly used to explain this algorithm.

Paxos 的参与者可以扮演三种角色之一:提议者接受者学习者

Participants in Paxos can take one of three roles: proposers, acceptors, or learners:

提案者
Proposers

接收来自客户的值,创建接受这些值的提案,并尝试收集接受者的选票。

Receive values from clients, create proposals to accept these values, and attempt to collect votes from acceptors.

接受者
Acceptors

投票接受或拒绝提议者提出的值。为了容错,该算法需要存在多个接受者,但为了活性,只需要接受者投票的法定人数(多数)即可接受提案。

Vote to accept or reject the values proposed by the proposer. For fault tolerance, the algorithm requires the presence of multiple acceptors, but for liveness, only a quorum (majority) of acceptor votes is required to accept the proposal.

学习者
Learners

扮演副本的角色,存储已接受提案的结果。

Take the role of replicas, storing the outcomes of the accepted proposals.

任何参与者都可以扮演任何角色,并且大多数实现将它们并置:单个流程可以同时成为提议者、接受者和学习者。

Any participant can take any role, and most implementations colocate them: a single process can simultaneously be a proposer, an acceptor, and a learner.

每个提案都包含一个由客户提出的value和一个唯一的单调递增提案编号。然后使用该数字来确保执行操作的总顺序并在它们之间建立之前/之后发生的关系。提案编号通常使用一(id, timestamp)对来实现,其中节点 ID 也是可比较的,并且可用于打破时间戳的联系。

Every proposal consists of a value, proposed by the client, and a unique monotonically increasing proposal number. This number is then used to ensure a total order of executed operations and establish happened-before/after relationships among them. Proposal numbers are often implemented using an (id, timestamp) pair, where node IDs are also comparable and can be used to break ties for timestamps.

Paxos算法

Paxos Algorithm

Paxos算法一般可以分为两个阶段:投票(或提议阶段)和复制。在投票阶段,提案者竞相建立自己的领导地位。在复制过程中,提议者将值分配给接受者。

The Paxos algorithm can be generally split into two phases: voting (or propose phase) and replication. During the voting phase, proposers compete to establish their leadership. During replication, the proposer distributes the value to the acceptors.

提议者是客户的初始联系人。它接收一个应该确定的值,并尝试从法定接受者中收集选票。完成后,接受者将有关约定值的信息分发给学习者,批准结果。学习者增加了已商定值的复制系数。

The proposer is an initial point of contact for the client. It receives a value that should be decided upon, and attempts to collect votes from the quorum of acceptors. When this is done, acceptors distribute the information about the agreed value to the learners, ratifying the result. Learners increase the replication factor of the value that’s been agreed on.

只有一名提议者才能获得多数选票。在某些情况下,投票可能会在提议者之间平均分配,并且他们都无法在这一轮中获得多数票,从而迫使他们重新开始。我们在“失败场景”中讨论了这种情况以及竞争提议者的其他场景。

Only one proposer can collect the majority of votes. Under some circumstances, votes may get split evenly between the proposers, and neither one of them will be able to collect a majority during this round, forcing them to restart. We discuss this and other scenarios of competing proposers in “Failure Scenarios”.

期间在提议阶段,提议者向大多数接受者发送Prepare(n)消息(其中是提议编号)并尝试收集他们的选票。n

During the propose phase, the proposer sends a Prepare(n) message (where n is a proposal number) to a majority of acceptors and attempts to collect their votes.

什么时候接受者收到准备请求,它必须做出响应,并保留以下不变量[LAMPORT01]

When the acceptor receives the prepare request, it has to respond, preserving the following invariants [LAMPORT01]:

  • 如果此接受器尚未响应具有更高序列号的准备请求,则它 承诺不会接受任何序列号较低的提案。

  • If this acceptor hasn’t responded to a prepare request with a higher sequence number yet, it promises that it will not accept any proposal with a lower sequence number.

  • 如果该接受者之前已经接受(收到消息)任何其他提案,它会回复一条消息,通知提案者它已经接受了带有序列号的提案。Accept!(m,vaccepted)Promise(m, vaccepted)m

  • If this acceptor has already accepted (received an Accept!(m,vaccepted) message) any other proposal earlier, it responds with a Promise(m, vaccepted) message, notifying the proposer that it has already accepted the proposal with a sequence number m.

  • 如果该接受者已经响应了具有更高序列号的准备请求,则它会通知提议者存在更高编号的提案。

  • If this acceptor has already responded to a prepare request with a higher sequence number, it notifies the proposer about the existence of a higher-numbered proposal.

  • Acceptor可以响应多个prepare请求,只要后一个具有更高的sequence number即可。

  • Acceptor can respond to more than one prepare request, as long as the later one has a higher sequence number .

在复制阶段,在收集大多数选票后,提案可以开始复制,通过向接受者发送一条包含 value和提案编号 的Accept!(n, v)消息来提交提案。是与从接受者收到的响应中编号最高的提案相关的值,或者如果接受者的响应不包含旧的已接受提案,则为它自己的任何值。vnv

During the replication phase, after collecting a majority of votes, the proposer can start the replication, where it commits the proposal by sending acceptors an Accept!(n, v) message with value v and proposal number n. v is the value associated with the highest-numbered proposal among the responses it received from acceptors, or any value of its own if their responses did not contain old accepted proposals.

接受者接受带有编号的提案,除非在提案阶段它已经做出了响应其中大于。如果接受者拒绝该提案,它将通过发送其所看到的最高序列号以及帮助提案者赶上的请求来通知提案者[LAMPORT01]nPrepare(m)mn

The acceptor accepts the proposal with a number n, unless during the propose phase it has already responded to Prepare(m), where m is greater than n. If the acceptor rejects the proposal, it notifies the proposer about it by sending the highest sequence number it has seen along with the request to help the proposer catch up [LAMPORT01].

您可以在图 14-3中看到 Paxos 回合的概括描述。

You can see a generalized depiction of a Paxos round in Figure 14-3.

数据库1403
图 14-3。Paxos算法:正常执行

一旦就价值达成共识(换句话说,它被至少一个接受者接受),未来的提议者必须决定相同的价值来保证协议。这就是接受者以他们接受的最新值进行响应的原因。如果没有接受者看到先前的值,则提议者可以自由选择自己的值。

Once a consensus was reached on the value (in other words, it was accepted by at least one acceptor), future proposers have to decide on the same value to guarantee the agreement. This is why acceptors respond with the latest value they’ve accepted. If no acceptor has seen a previous value, the proposer is free to choose its own value.

A学习者必须找出已经决定的值,这是在收到大多数接受者的通知后才能知道的。为了让学习者尽快知道新值,接受者可以在接受该值后立即通知它。如果有多个学习者,则每个接受者都必须通知每个学习者。可以区分一个或多个学习者,在这种情况下,它将通知其他学习者有关可接受的值。

A learner has to find out the value that has been decided, which it can know after receiving notification from the majority of acceptors. To let the learner know about the new value as soon as possible, acceptors can notify it about the value as soon as they accept it. If there’s more than one learner, each acceptor will have to notify each learner. One or more learners can be distinguished, in which case it will notify other learners about accepted values.

总之,第一个算法阶段的目标是建立该轮的领导者并了解哪个值将被接受,从而允许领导者继续进行第二阶段:广播该值。出于基本算法的目的,我们假设每次我们想要决定一个值时都必须执行这两个阶段。在实践中,我们希望减少算法的步骤数,因此我们允许提议者提出多个值。我们稍后在“Multi-Paxos”中更详细地讨论这一点。

In summary, the goal of the first algorithm phase is to establish a leader for the round and understand which value is going to be accepted, allowing the leader to proceed with the second phase: broadcasting the value. For the purpose of the base algorithm, we assume that we have to perform both phases every time we’d like to decide on a value. In practice, we’d like to reduce the number of steps in the algorithm, so we allow the proposer to propose more than one value. We discuss this in more detail later in “Multi-Paxos”.

Paxos 中的法定人数

Quorums in Paxos

法定人数用于确保某些参与者可能会失败,但只要我们能从活着的参与者那里收集选票,我们仍然可以继续进行。法定人数执行操作所需的最低投票数。这个数字通常构成参与者的大多数。仲裁背后的主要思想是,即使参与者失败或碰巧被网络分区分开,至少有一个参与者充当仲裁者,确保协议的正确性。

Quorums are used to make sure that some of the participants can fail, but we still can proceed as long as we can collect votes from the alive ones. A quorum is the minimum number of votes required for the operation to be performed. This number usually constitutes a majority of participants. The main idea behind quorums is that even if participants fail or happen to be separated by the network partition, there’s at least one participant that acts as an arbiter, ensuring protocol correctness.

一旦足够数量的参与者接受该提案,该值就保证被协议接受,因为任何两个多数都至少有一个共同的参与者。

Once a sufficient number of participants accept the proposal, the value is guaranteed to be accepted by the protocol, since any two majorities have at least one participant in common.

Paxos 保证在出现任意数量的故障时的安全性。没有任何配置会产生不正确或不一致的状态,因为这与共识的定义相矛盾。

Paxos guarantees safety in the presence of any number of failures. There’s no configuration that can produce incorrect or inconsistent states since this would contradict the definition of consensus.

在存在失败进程的情况下保证活性f。为此,该协议需要2f + 1总共的进程,以便如果f进程发生故障,仍然有f + 1进程能够继续进行。通过使用仲裁,而不是要求所有进程都存在,Paxos(和其他共识算法)即使在f进程发生故障时也能保证结果。在“Flexible Paxos”中,我们用稍微不同的术语来讨论仲裁,并描述如何构建仅需要算法步骤之间仲裁交集的协议。

Liveness is guaranteed in the presence of f failed processes. For that, the protocol requires 2f + 1 processes in total so that, if f processes happen to fail, there are still f + 1 processes able to proceed. By using quorums, rather than requiring the presence of all processes, Paxos (and other consensus algorithms) guarantee results even when f process failures occur. In “Flexible Paxos”, we talk about quorums in slightly different terms and describe how to build protocols requiring quorum intersection between algorithm steps only.

提示

重要的是要记住,仲裁仅描述系统的阻塞属性。为了保证安全,每一步我们都必须等待至少来自法定数量的节点的响应。我们可以向更多节点发送提案并接受命令;我们只是不必等待他们的答复来继续。我们可能会向更多节点发送消息(一些系统使用推测执行:发出冗余查询,以帮助在节点故障的情况下实现所需的响应计数),但为了保证活性,我们可以在收到法定人数的消息后立即继续。

It is important to remember that quorums only describe the blocking properties of the system. To guarantee safety, for each step we have to wait for responses from at least a quorum of nodes. We can send proposals and accept commands to more nodes; we just do not have to wait for their responses to proceed. We may send messages to more nodes (some systems use speculative execution: issuing redundant queries that help to achieve the required response count in case of node failures), but to guarantee liveness, we can proceed as soon as we hear from the quorum.

失败场景

Failure Scenarios

讨论中当讨论失败时,分布式算法变得特别有趣。展示容错能力的失败场景之一是,当提议者在第二阶段失败时,在它能够将值广播给所有接受者之前(如果提议者还活着,但速度缓慢或无法与接受者通信,则可能会发生类似的情况)一些接受者)。在这种情况下,新的提议者可以拾取并提交该值,并将其分发给其他参与者。

Discussing distributed algorithms gets particularly interesting when failures are discussed. One of the failure scenarios, demonstrating fault tolerance, is when the proposer fails during the second phase, before it is able to broadcast the value to all the acceptors (a similar situation can happen if the proposer is alive but is slow or cannot communicate with some acceptors). In this case, the new proposer may pick up and commit the value, distributing it to the other participants.

图 14-4显示了这种情况:

Figure 14-4 shows this situation:

  • 提案者使用提案编号经历选举阶段,但在将值发送给仅一个接受者后失败。P11V1A1

  • Proposer P1 goes through the election phase with a proposal number 1, but fails after sending the value V1 to just one acceptor A1.

  • 另一个提议者以更高的提议编号 开始新一轮,收集接受者响应的法定人数(在本例中),并继续提交由 提议的值。P22A1A2V1P1

  • Another proposer P2 starts a new round with a higher proposal number 2, collects a quorum of acceptor responses (A1 and A2 in this case), and proceeds by committing the old value V1, proposed by P1.

数据库1404
图 14-4。Paxos失败场景:提议者失败,决定旧值

由于算法状态被复制到多个节点,提议者失败不会导致无法达成共识。如果当前提议者在单个接受者接受该值后失败,则其提议可以由下一个提议者选择。这也意味着所有这一切都可能在原始提议者不知情的情况下发生。A1

Since the algorithm state is replicated to multiple nodes, proposer failure does not result in failure to reach a consensus. If the current proposer fails after even a single acceptor A1 has accepted the value, its proposal can be picked by the next proposer. This also implies that all of it may happen without the original proposer knowing about it.

在客户端/服务器应用程序中,客户端仅连接到原始提议者,这可能会导致客户端不知道 Paxos 回合执行结果的情况。1

In a client/server application, where the client is connected only to the original proposer, this might lead to situations where the client doesn’t know about the result of the Paxos round execution.1

然而,其他场景也是可能的,如图14-5所示。例如:

However, other scenarios are possible, too, as Figure 14-5 shows. For example:

  • P1V1就像前面的示例一样,仅将值发送到 后失败了。A1

  • P1 has failed just like in the previous example, after sending the value V1 only to A1.

  • 下一个提议者 ,以更高的提议编号 开始新一轮,并收集法定数量的接受者响应,但这一次和是第一个响应的。收集法定人数后,尽管理论上存在不同的承诺值,但仍会提交自己的值P22A2A3P2A1

  • The next proposer, P2, starts a new round with a higher proposal number 2, and collects a quorum of acceptor responses, but this time A2 and A3 are first to respond. After collecting a quorum, P2 commits its own value despite the fact that theoretically there’s a different committed value on A1.

数据库1405
图 14-5。Paxos失败场景:提议者失败,决定新值

这里还有一种可能性,如图14-6所示:

There’s one more possibility here, shown in Figure 14-6:

  • 只有一个接受者接受该值后,提议者就会失败。在接受提案后不久,在通知下一个提案者其值之前失败。P1A1V1A1

  • Proposer P1 fails after only one acceptor A1 accepts the value V1. A1 fails shortly after accepting the proposal, before it can notify the next proposer about its value.

  • Proposer在失败后开始本轮,不会重叠并继续提交其值。P2P1A1

  • Proposer P2, which started the round after P1 failed, does not overlap with A1 and proceeds to commit its value instead.

  • 本轮之后出现的与 重叠的任何提案者都将忽略的值并选择更新的已接受提案。A1A1

  • Any proposer that comes after this round that will overlap with A1, will ignore A1’s value and choose a more recent accepted proposal instead.

数据库1406
图 14-6。Paxos失败场景:提议者失败,其次是接受者失败

另一种失败情况是,两个或多个提议者开始竞争,每个提议者都试图通过提议阶段,但由于另一个提议者击败了他们,始终未能获得多数票。

Another failure scenario is when two or more proposers start competing, each trying to get through the propose phase, but keep failing to collect a majority because the other one beat them to it.

虽然接受者承诺不接受任何编号较低的提案,但他们仍然可以响应多个准备请求,只要后一个请求的序列号较高。当提议者尝试提交该值时,它可能会发现接受者已经响应了具有更高序列号的准备请求。这可能会导致多个提议者不断重试并阻止彼此进一步取得进展。这个问题通常通过结合随机退避来解决,这最终让一个提议者在另一个提议者休眠时继续进行。

While acceptors promise not to accept any proposals with a lower number, they still may respond to multiple prepare requests, as long as the later one has a higher sequence number. When a proposer tries to commit the value, it might find that acceptors have already responded to a prepare request with a higher sequence number. This may lead to multiple proposers constantly retrying and preventing each other from further progress. This problem is usually solved by incorporating a random backoff, which eventually lets one of the proposers proceed while the other one sleeps.

Paxos 算法可以容忍接受者失败,但前提是仍然有足够的活着的接受者形成多数。

The Paxos algorithm can tolerate acceptor failures, but only if there are still enough acceptors alive to form a majority.

多Paxos

Multi-Paxos

所以到目前为止,我们讨论了经典的 Paxos 算法,我们选择任意提议者并尝试启动 Paxos 回合。这种方法的问题之一是系统中发生的每一轮复制都需要一轮提议。只有在该轮中确定了提议者之后(大多数接受者对Promise提议者的做出响应后发生Prepare),才能开始复制。为了避免重复提议阶段并让提议者重用其识别的位置,我们可以使用Multi-Paxos,它引入了概念领导者:杰出 提议者 [LAMPORT01]。这是一个至关重要的补充,显着提高了算法效率。

So far we discussed the classic Paxos algorithm, where we pick an arbitrary proposer and attempt to start a Paxos round. One of the problems with this approach is that a propose round is required for each replication round that occurs in the system. Only after the proposer is established for the round, which happens after a majority of acceptors respond with a Promise to the proposer’s Prepare, can it start the replication. To avoid repeating the propose phase and let the proposer reuse its recognized position, we can use Multi-Paxos, which introduces the concept of a leader: a distinguished proposer [LAMPORT01]. This is a crucial addition, significantly improving algorithm efficiency.

有了一个既定的领导者,我们可以跳过提议阶段并直接进行复制:分发值并收集接受者确认。

Having an established leader, we can skip the propose phase and proceed straight to replication: distributing a value and collecting acceptor acknowledgments.

在经典的 Paxos 算法中,可以通过运行 Paxos 轮次来实现读取,该轮次将从不完整的轮次中收集任何值(如果存在)。必须这样做,因为最后一个已知的提议者不能保证保存最新的数据,因为可能有另一个提议者在提议者不知情的情况下修改了状态。

In the classic Paxos algorithm, reads can be implemented by running a Paxos round that would collect any values from incomplete rounds if they’re present. This has to be done because the last known proposer is not guaranteed to hold the most recent data, since there might have been a different proposer that has modified state without the proposer knowing about it.

Multi-Paxos 中可能会出现类似的情况:我们尝试另一个领导者当选为了避免这种情况并保证没有其他进程可以成功提交值,一些 Multi-Paxos 实现使用租约。领导者定期联系参与者,通知他们它仍然存在,从而有效地延长其租约。参与者必须做出回应并允许领导者继续运营,并承诺在租赁期间不会接受其他领导者的提案[CHANDRA07]

A similar situation may occur in Multi-Paxos: we’re trying to perform a read from the known leader after the other leader is already elected, returning stale data, which contradicts the linearizability guarantees of consensus. To avoid that and guarantee that no other process can successfully submit values, some Multi-Paxos implementations use leases. The leader periodically contacts the participants, notifying them that it is still alive, effectively prolonging its lease. Participants have to respond and allow the leader to continue operation, promising that they will not accept proposals from other leaders for the period of the lease [CHANDRA07].

租约不是正确性保证,而是一种性能优化,允许从活跃领导者读取数据而无需收集法定人数。为了保证安全,租赁依赖于参与者之间有界的时钟同步。如果他们的时钟漂移太大,并且领导者认为其租约仍然有效,而其他参与者认为其租约已过期,则无法保证线性化。

Leases are not a correctness guarantee, but a performance optimization that allows reads from the active leader without collecting a quorum. To guarantee safety, leases rely on the bounded clock synchrony between the participants. If their clocks drift too much and the leader assumes its lease is still valid while other participants think its lease has expired, linearizability cannot be guaranteed.

多Paxos有时被描述为应用于某些结构的操作的复制日志。该算法不考虑该结构的语义,只关心一致地复制将附加到该日志的值。为了在进程崩溃时保留状态,参与者会保留接收到的消息的持久日志。

Multi-Paxos is sometimes described as a replicated log of operations applied to some structure. The algorithm is oblivious to the semantics of this structure and is only concerned with consistently replicating values that will be appended to this log. To preserve the state in case of process crashes, participants keep a durable log of received messages.

为了防止日志无限增长,其内容应应用于上述结构。日志内容与主结构同步后,创建快照,可以截断日志。日志和状态快照应该相互一致,并且快照更改应该通过截断日志段以原子方式应用[CHANDRA07]

To prevent a log from growing indefinitely large, its contents should be applied to the aforementioned structure. After the log contents are synchronized with a primary structure, creating a snapshot, the log can be truncated. Log and state snapshots should be mutually consistent, and snapshot changes should be applied atomically with truncation of the log segment [CHANDRA07].

我们可以将单法令 Paxos 视为一次写入寄存器:我​​们有一个可以在其中放置值的槽,并且一旦我们将值写入其中,就不可能进行后续修改。在第一步中,提议者竞争寄存器的所有权,在第二阶段中,其中一个提议者写入该值。同时,Multi-Paxos可以被认为是一个仅追加的日志,由一系列这样的值组成:我们可以一次写入一个值,所有值都是严格排序的,并且我们不能修改已经写入的值[雷斯特索夫16]。有一些共识算法的示例提供读取-修改-写入寄存器的集合并使用状态共享而不是复制状态机,例如 Active Disk Paxos [CHOCKLER15]和 CASPaxos [RYSTSOV18]

We can think of single-decree Paxos as a write-once register: we have a slot where we can put a value, and as soon as we’ve written the value there, no subsequent modifications are possible. During the first step, proposers compete for ownership of the register, and during the second phase, one of them writes the value. At the same time, Multi-Paxos can be thought of as an append-only log, consisting of a sequence of such values: we can write one value at a time, all values are strictly ordered, and we cannot modify already written values [RYSTSOV16]. There are examples of consensus algorithms that offer collections of read-modify-write registers and use state sharing rather than replicated state machines, such as Active Disk Paxos [CHOCKLER15] and CASPaxos [RYSTSOV18].

快速 Paxos

Fast Paxos

我们与经典的 Paxos 算法相比,通过让任何提议者直接联系接受者而不是通过领导者,可以减少一次往返次数。为此,与经典 Paxos 相比,我们需要将仲裁大小增加到2f + 1(其中f是允许失败的进程数)f + 1,并将接受器总数增加到3f + 1 [JUNQUEIRA07]。这种优化称为Fast Paxos [LAMPORT06]

We can reduce the number of round-trips by one, compared to the classic Paxos algorithm, by letting any proposer contact acceptors directly rather than going through the leader. For this, we need to increase the quorum size to 2f + 1 (where f is the number of processes allowed to fail), compared to f + 1 in classic Paxos, and a total number of acceptors to 3f + 1 [JUNQUEIRA07]. This optimization is called Fast Paxos [LAMPORT06].

经典的 Paxos 算法有一个条件,即在复制阶段,提议者可以选择在提议阶段收集到的任何值。Fast Paxos 有两种类型的轮次:classic,其中算法以与经典版本相同的方式进行,以及fast,其中允许接受者接受其他值。

The classic Paxos algorithm has a condition, where during the replication phase, the proposer can pick any value it has collected during the propose phase. Fast Paxos has two types of rounds: classic, where the algorithm proceeds the same way as the classic version, and fast, where it allows acceptors to accept other values.

在描述该算法时,我们将在提议阶段收集到足够数量的响应的提议者称为协调者并为所有其他提议者保留术语提议者。一些 Fast Paxos 描述说客户端可以直接联系接受者[ZHAO15]

While describing this algorithm, we will refer to the proposer that has collected a sufficient number of responses during the propose phase as a coordinator, and reserve term proposer for all other proposers. Some Fast Paxos descriptions say that clients can contact acceptors directly [ZHAO15].

在快速回合中,如果允许协调器在复制阶段选择自己的值,它可以Any向接受器发出特殊消息。在这种情况下,接受者可以将任何提议者的值视为经典回合,并且他们从协调者那里收到了包含该值的消息。换句话说,接受者独立地决定从不同提议者那里收到的值。

In a fast round, if the coordinator is permitted to pick its own value during the replication phase, it can instead issue a special Any message to acceptors. Acceptors, in this case, are allowed to treat any proposer’s value as if it is a classic round and they received a message with this value from the coordinator. In other words, acceptors independently decide on values they receive from different proposers.

图 14-7显示了 Fast Paxos 中经典轮和快速轮的示例。从图像上看,快速回合可能有更多的执行步骤,但请记住,在经典回合中,为了提交其值,提议者需要通过协调者来提交其值。

Figure 14-7 shows an example of classic and fast rounds in Fast Paxos. From the image it might look like the fast round has more execution steps, but keep in mind that in a classic round, in order to submit its value, the proposer would need to go through the coordinator to get its value committed.

数据库1407
图 14-7。Fast Paxos算法:快速且经典的回合

算法很容易发生冲突,如果两个或多个提议者尝试使用快速步骤并减少往返次数,并且接受者收到不同的值,就会发生冲突。协调员必须进行干预并通过启动新一轮来开始恢复。

This algorithm is prone to collisions, which occur if two or more proposers attempt to use the fast step and reduce the number of round-trips, and acceptors receive different values. The coordinator has to intervene and start recovery by initiating a new round.

这意味着接受者在收到来自不同提议者的值后,可能会决定冲突的值。当协调器检测到冲突(值冲突)时,它必须重新启动一个Propose阶段以使接受器收敛到单个值。

This means that acceptors, after receiving values from different proposers, may decide on conflicting values. When the coordinator detects a conflict (value collision), it has to reinitiate a Propose phase to let acceptors converge to a single value.

Fast Paxos 的缺点之一是,如果请求率较高,则会增加往返次数和冲突时的请求延迟。[JUNQUEIRA07]表明,由于副本数量的增加以及随后参与者之间交换的消息的增加,尽管步骤数量减少了,但 Fast Paxos 的延迟可能比其经典对应物更高。

One of the disadvantages of Fast Paxos is the increased number of round-trips and request latency on collisions if the request rate is high. [JUNQUEIRA07] shows that, due to the increased number of replicas and, subsequently, messages exchanged between the participants, despite a reduced number of steps, Fast Paxos can have higher latencies than its classic counterpart.

平等主义 Paxos

Egalitarian Paxos

使用杰出的提议者作为领导者会使系统容易失败:一旦领导者失败,系统就必须选举一个新的领导者,然后才能继续进行下一步。另一个问题是,拥有领导者可能会给其带来不成比例的负载,从而损害系统性能。

Using a distinguished proposer as a leader makes a system prone to failures: as soon as the leader fails, the system has to elect a new one before it can proceed with further steps. Another problem is that having a leader can put a disproportionate load on it, impairing system performance.

笔记

避免将整个系统负载置于领导者身上的方法之一是分区。许多系统将可能值的范围分成更小的部分,并允许系统的一部分负责特定范围,而不必担心其他部分。这有助于提高可用性(通过将故障隔离到单个分区并防止传播到系统的其他部分)、性能(因为服务不同值的段不重叠)和可扩展性(因为我们可以通过增加分区数量来扩展系统) 。重要的是要记住,执行操作针对多个分区将需要原子承诺。

One of the ways to avoid putting an entire system load on the leader is partitioning. Many systems split the range of possible values into smaller segments and allow a part of the system to be responsible for a specific range without having to worry about the other parts. This helps with availability (by isolating failures to a single partition and preventing propagation to other parts of the system), performance (since segments serving different values are nonoverlapping), and scalability (since we can scale the system by increasing the number of partitions). It is important to keep in mind that performing an operation against multiple partitions will require an atomic commitment.

我们可以使用负责特定命令提交的领导者,并通过查找和设置依赖关系来建立顺序,而不是使用领导者和提案编号来对命令进行排序。这种方法通常称为平等 Paxos,或 EPaxos [MORARU11]。允许将非冲突写入独立地提交到复制状态机的想法首次在[LAMPORT05]中引入,称为 Generalized Paxos。EPaxos 是 Generalized Paxos 的第一个实现。

Instead of using a leader and proposal numbers for sequencing commands, we can use a leader responsible for the commit of the specific command, and establish the order by looking up and setting dependencies. This approach is commonly called Egalitarian Paxos, or EPaxos [MORARU11]. The idea of allowing nonconflicting writes to be committed to the replicated state machine independently was first introduced in [LAMPORT05] and called Generalized Paxos. EPaxos is a first implementation of Generalized Paxos.

EPaxos 试图提供经典 Paxos 算法和 Multi-Paxos 的优点。经典 Paxos 提供高可用性,因为每轮都会建立一个领导者,但消息复杂性更高。Multi-Paxos 提供高吞吐量并且需要更少的消息,但领导者可能会成为瓶颈。

EPaxos attempts to offer benefits of both the classic Paxos algorithm and Multi-Paxos. Classic Paxos offers high availability, since a leader is established during each round, but has a higher message complexity. Multi-Paxos offers high throughput and requires fewer messages, but a leader may become a bottleneck.

埃帕索斯从预接受阶段开始,在此期间流程成为特定提案的领导者。每项提案必须包括:

EPaxos starts with a Pre-Accept phase, during which a process becomes a leader for the specific proposal. Every proposal has to include:

依赖关系
Dependencies

所有可能干扰当前提案但不一定已提交的命令。

All commands that potentially interfere with a current proposal, but are not necessarily already committed.

序列号
A sequence number

这打破了依赖关系之间的循环。将其设置为大于已知依赖项的任何序列号的值。

This breaks cycles between the dependencies. Set it with a value larger than any sequence number of the known dependencies.

收集此信息后,它将Pre-Accept消息转发给快速法定数量的副本。快速仲裁是⌈3f/4⌉副本,其中f是可容忍的故障数量

After collecting this information, it forwards a Pre-Accept message to a fast quorum of replicas. A fast quorum is ⌈3f/4⌉ replicas, where f is the number of tolerated failures.

副本检查其本地命令日志,根据潜在冲突提案的视图更新提案依赖项,并将此信息发送回领导者。如果领导者收到来自快速仲裁副本的响应,并且它们的依赖关系列表彼此一致并且与领导者本身一致,则它可以提交该命令。

Replicas check their local command logs, update the proposal dependencies based on their view of potentially conflicting proposals, and send this information back to the leader. If the leader receives responses from a fast quorum of replicas, and their dependency lists are in agreement with each other and the leader itself, it can commit the command.

如果领导者没有收到足够的响应,或者从副本收到的命令列表不同并且包含干扰命令,则它会使用新的依赖项列表和序列号更新其提议。新的依赖项列表基于以前的副本响应并组合了所有收集的依赖项。新的序列号必须大于副本所看到的最高序列号。之后,领导者将新的、更新的命令发送到⌊f/2⌋ + 1副本。完成此操作后,领导者终于可以提交提案了。

If the leader does not receive enough responses or if the command lists received from the replicas differ and contain interfering commands, it updates its proposal with a new dependency list and a sequence number. The new dependency list is based on previous replica responses and combines all collected dependencies. The new sequence number has to be larger than the highest sequence number seen by the replicas. After that, the leader sends the new, updated command to ⌊f/2⌋ + 1 replicas. After this is done, the leader can finally commit the proposal.

实际上,我们有两种可能的情况:

Effectively, we have two possible scenarios:

快速路径
Fast path

当依赖项匹配并且领导者可以仅使用快速仲裁副本安全地继续提交阶段。

When dependencies match and the leader can safely proceed with the commit phase with only a fast quorum of replicas.

慢速路径
Slow path

当副本之间存在分歧时,必须先更新它们的命令列表,然后领导者才能继续提交。

When there’s a disagreement between the replicas, and their command lists have to be updated before the leader can proceed with a commit.

图 14-8显示了这些场景 —启动快速路径运行和启动慢速路径运行:P1P5

Figure 14-8 shows these scenarios—P1 initiating a fast path run, and P5 initiating a slow path run:

  • P1以提案编号开头1且无依赖性,然后发送PreAccept(1, ∅)消息。由于和的命令日志为空,因此可以继续提交。P2P3P1

  • P1 starts with proposal number 1 and no dependencies, and sends a PreAccept(1, ∅) message. Since the command logs of P2 and P3 are empty, P1 can proceed with a commit.

  • P5创建一个序列号为 的提案2。由于此时其命令日志为空,因此它也声明没有依赖项并发送消息PreAccept(2, ∅)。不知道已提交的提案,但通知冲突并发送其命令日志:。P41P3P5{1}

  • P5 creates a proposal with sequence number 2. Since its command log is empty by that point, it also declares no dependencies and sends a PreAccept(2, ∅) message. P4 is not aware of the committed proposal 1, but P3 notifies P5 about the conflict and sends its command log: {1}.

  • P5更新其本地依赖项列表并发送一条消息以确保副本具有相同的依赖项:Accept(2,{1})。一旦副本响应,它就可以提交该值。

  • P5 updates its local dependency list and sends a message to make sure replicas have the same dependencies: Accept(2,{1}). As soon as the replicas respond, it can commit the value.

数据库1408
图 14-8。EPaxos算法运行

两个命令AB,仅在其执行顺序重要时进行干扰;换句话说,如果A之前执行和之前B执行会产生不同的结果。BA

Two commands, A and B, interfere only if their execution order matters; in other words, if executing A before B and executing B before A produce different results.

提交是通过响应客户端并用Commit消息异步通知副本来完成的。命令在提交执行。

Commit is done by responding to the client and asynchronously notifying replicas with a Commit message. Commands are executed after they’re committed.

由于依赖关系是在 Pre-Accept 阶段收集的,因此在执行请求时,命令顺序已经建立,并且任何命令都不会突然出现在中间的某个位置:它只能附加在具有最大序列号的命令之后

Since dependencies are collected during the Pre-Accept phase, by the time requests are executed, the command order is already established and no command can suddenly appear somewhere in-between: it can only get appended after the command with the largest sequence number.

为了执行命令,副本构建依赖图并以反向依赖顺序执行所有命令。换句话说,在执行命令之前,必须执行其所有依赖项(以及随后的所有依赖项)。由于只有干扰命令必须相互依赖,因此对于大多数工作负载来说,这种情况应该相对罕见[MORARU13]

To execute a command, replicas build a dependency graph and execute all commands in a reverse dependency order. In other words, before a command can be executed, all its dependencies (and, subsequently, all their dependencies) have to be executed. Since only interfering commands have to depend on each other, this situation should be relatively rare for most workloads [MORARU13].

与 Paxos 类似,EPaxos 使用提案编号,这可以防止传播过时的消息。序列号由纪元(当节点离开和加入集群时更改的当前集群配置的标识符)、单调递增的节点本地计数器和副本 ID 组成。如果副本收到序列号低于其已见过的序列号的提议,则它会否定该提议,并发送其已知的最高序列号和更新的命令列表作为响应。

Similar to Paxos, EPaxos uses proposal numbers, which prevent stale messages from being propagated. Sequence numbers consist of an epoch (identifier of the current cluster configuration that changes when nodes leave and join the cluster), a monotonically incremented node-local counter, and a replica ID. If a replica receives a proposal with a sequence number lower than one it has already seen, it negatively acknowledges the proposal, and sends the highest sequence number and an updated command list known to it in response.

灵活的Paxos

Flexible Paxos

A法定人数通常定义为大多数进程。根据定义,无论我们如何选择节点,两个群体之间都会有交集:总是至少有一个节点可以打破平局。

A quorum is usually defined as a majority of processes. By definition, we have an intersection between two quorums no matter how we pick nodes: there’s always at least one node that can break ties.

我们必须回答两个重要问题:

We have to answer two important questions:

  • 是否有必要在每个执行步骤中联系大多数服务器?

  • Is it necessary to contact the majority of servers during every execution step?

  • 所有法定人数都必须相交吗?换句话说,我们用来选择杰出提议者的法定人数(第一阶段)、我们用来决定值的法定人数(第二阶段)以及每个执行实例(例如,如果执行第二步的多个实例)同时),必须有共同的节点吗?

  • Do all quorums have to intersect? In other words, does a quorum we use to pick a distinguished proposer (first phase), a quorum we use to decide on a value (second phase), and every execution instance (for example, if multiple instances of the second step are executed concurrently), have to have nodes in common?

由于我们仍在谈论共识,因此我们无法更改任何安全定义:算法必须保证一致性。

Since we’re still talking about consensus, we cannot change any safety definitions: the algorithm has to guarantee the agreement.

在 Multi-Paxos 中,领导者选举阶段并不频繁,并且允许杰出的提议者在不重新运行选举阶段的情况下提交多个值,从而有可能在更长的时间内保持领先地位。在“可调一致性”中,我们讨论了帮助我们找到节点集之间有交集的配置的公式。示例之一是仅等待一个节点确认写入(并让对其余节点的请求异步完成),然后从所有节点读取。换句话说,只要我们保持R + W > N,读集和写集之间就至少有一个公共节点。

In Multi-Paxos, the leader election phase is infrequent, and the distinguished proposer is allowed to commit several values without rerunning the election phase, potentially staying in the lead for a longer period. In “Tunable Consistency”, we discussed formulae that help us to find configurations where we have intersections between the node sets. One of the examples was to wait for just one node to acknowledge the write (and let the requests to the rest of nodes finish asynchronously), and read from all the nodes. In other words, as long as we keep R + W > N, there’s at least one node in common between read and write sets.

我们可以使用类似的逻辑来达成共识吗?事实证明我们可以,在 Paxos 中我们只需要第一阶段(选举领导者)的节点组与第二阶段(参与接受提案)的节点组重叠。

Can we use a similar logic for consensus? It turns out that we can, and in Paxos we only require the group of nodes from the first phase (that elects a leader) to overlap with the group from the second phase (that participates in accepting proposals).

换句话说,仲裁不必定义为多数,而只需定义为非空节点组。如果我们将参与者总数定义为N,提议阶段成功所需的节点数为Q₁,接受阶段成功所需的节点数为Q₂,我们只需要确保这一点Q₁ + Q₂ > N。由于第二相通常比第一相更常见,因此Q₂只能包含N/2受体,只要Q₁将 调整为相应较大即可(Q₁ = N - Q₂ + 1)。这一发现对于理解共识至关重要。使用这种方法的算法称为Flexible Paxos [HOWARD16]

In other words, a quorum doesn’t have to be defined as a majority, but only as a non-empty group of nodes. If we define a total number of participants as N, the number of nodes required for a propose phase to succeed as Q₁, and the number of nodes required for the accept phase to succeed as Q₂, we only need to ensure that Q₁ + Q₂ > N. Since the second phase is usually more common than the first one, Q₂ can contain only N/2 acceptors, as long as Q₁ is adjusted to be correspondingly larger (Q₁ = N - Q₂ + 1). This finding is an important observation crucial for understanding consensus. The algorithm that uses this approach is called Flexible Paxos [HOWARD16].

例如,如果我们有五个接受者,只要我们需要从其中四个接受者那里收集选票来赢得选举,我们就可以让领导者在复制阶段等待两个节点的响应。此外,由于由两个接受者组成的任何子集与领导者选举法定人数之间存在重叠,因此我们可以向不相交的接受者集提交提案。直观上,这是有效的,因为每当新领导者在当前领导者不知情的情况下当选时,总会有至少一个接受者知道新领导者的存在。

For example, if we have five acceptors, as long as we require collecting votes from four of them to win the election round, we can allow the leader to wait for responses from two nodes during the replication stage. Moreover, since there’s an overlap between any subset consisting of two acceptors with the leader election quorum, we can submit proposals to disjoint sets of acceptors. Intuitively, this works because whenever a new leader is elected without the current one being aware of it, there will always be at least one acceptor that knows about the existence of the new leader.

灵活的 Paxos 允许延迟交易可用性:我们减少了参与第二阶段的节点数量,但必须收集更多的选票,从而要求在领导者选举阶段有更多的参与者可用。N - Q₂好消息是,只要当前领导者稳定并且不需要新一轮选举,这种配置就可以继续复制阶段并容忍最多节点的故障。

Flexible Paxos allows trading availability for latency: we reduce the number of nodes participating in the second phase but have to collect more votes, requiring more participants to be available during the leader election phase. The good news is that this configuration can continue the replication phase and tolerate failures of up to N - Q₂ nodes, as long as the current leader is stable and a new election round is not required.

另一种使用相交群体思想的 Paxos 变体是 Vertical Paxos。垂直 Paxos 区分读取仲裁和写入仲裁。这些法定人数必须相交。领导者必须为一个或多个编号较低的提案收集较小的读取法定人数,并为其自己的提案收集较大的写入法定人数[LAMPORT09][LAMPSON01]还区分了输出仲裁和决策仲裁,它们转化为准备阶段和接受阶段,并给出了类似于Flexible Paxos 的仲裁定义。

Another Paxos variant using the idea of intersecting quorums is Vertical Paxos. Vertical Paxos distinguishes between read and write quorums. These quorums must intersect. A leader has to collect a smaller read quorum for one or more lower-numbered proposals, and a larger write quorum for its own proposal [LAMPORT09]. [LAMPSON01] also distinguishes between the out and decision quorums, which translate to prepare and accept phases, and gives a quorum definition similar to Flexible Paxos.

共识的通用解决方案

Generalized Solution to Consensus

帕克索斯有时可能有点难以推理:多个角色、步骤和所有可能的变化都很难跟踪。但我们可以用更简单的方式来思考它。我们可以使用一组简单的概念和规则来实现单一法令 Paxos 的保证,而不是在参与者之间划分角色并进行决策轮次。我们只是简单地讨论这种方法,因为这是一个相对较新的发展[HOWARD19] ——了解这一点很重要,但我们还没有看到它的实现和实际应用。

Paxos might sometimes be a bit difficult to reason about: multiple roles, steps, and all the possible variations are hard to keep track of. But we can think of it in simpler terms. Instead of splitting roles between the participants and having decision rounds, we can use a simple set of concepts and rules to achieve guarantees of a single-decree Paxos. We discuss this approach only briefly as this is a relatively new development [HOWARD19]—it’s important to know, but we’ve yet to see its implementations and practical applications.

我们有一个客户端和一组服务器。每个服务器有多个寄存器。寄存器有一个索引来标识它,只能写入一次,并且可以处于以下三种状态之一:未写入、包含 value包含nil(特殊的空值)。

We have a client and a set of servers. Each server has multiple registers. A register has an index identifying it, can be written only once, and it can be in one of three states: unwritten, containing a value, and containing nil (a special empty value).

位于不同服务器上的具有相同索引的寄存器形成一个寄存器组。每个寄存器组可以有一个或多个法定人数。根据其中寄存器的状态,法定人数可以处于未决定(AnyMaybe v) 或已决定(NoneDecided v) 状态之一:

Registers with the same index located on different servers form a register set. Each register set can have one or more quorums. Depending on the state of the registers in it, a quorum can be in one of the undecided (Any and Maybe v), or decided (None and Decided v) states:

Any
Any

根据未来的操作,该仲裁集可以决定任何值。

Depending on future operations, this quorum set can decide on any value.

Maybe v
Maybe v

如果该法定人数达成决定,则其决定只能是v

If this quorum reaches a decision, its decision can only be v.

None
None

该法定人数无法决定该值。

This quorum cannot decide on the value.

Decided v
Decided v

该法定人数已决定该值v

This quorum has decided on the value v.

客户端与服务器交换消息并维护一个状态表,在其中跟踪值和寄存器,并可以推断仲裁做出的决策。

The client exchanges messages with the servers and maintains a state table, where it keeps track of values and registers, and can infer decisions made by the quorums.

为了保持正确性,我们必须限制客户端与服务器交互的方式以及它们可以写入和不可以写入哪些值。在读取值方面,客户端只有从同一寄存器集中的法定服务器中读取到该值,才能输出决定的值。

To maintain correctness, we have to limit how clients can interact with servers and which values they may write and which they may not. In terms of reading values, the client can output the decided value only if it has read it from the quorum of servers in the same register set.

编写规则稍微复杂一些,因为为了保证算法安全,我们必须保留几个不变量。首先,我们必须确保客户端不只是提出新值:只有在接收到特定值作为输入或从寄存器中读取特定值时,才允许将特定值写入寄存器。客户端无法写入允许同一寄存器中的不同仲裁决定不同值的值。最后,客户端不能写入覆盖先前寄存器集中做出的先前决策的值(寄存器集中做出的决策必须是r - 1NoneMaybe vDecided v

The writing rules are slightly more involved because to guarantee algorithm safety, we have to preserve several invariants. First, we have to make sure that the client doesn’t just come up with new values: it is allowed to write a specific value to the register only if it has received it as input or has read it from a register. Clients cannot write values that allow different quorums in the same register to decide on different values. Lastly, clients cannot write values that override previous decisions made in the previous register sets (decisions made in register sets up to r - 1 have to be None, Maybe v, or Decided v).

广义 Paxos 算法

Generalized Paxos algorithm

将所有这些规则放在一起,我们可以实现一种通用的 Paxos 算法,该算法使用一次写入寄存器[HOWARD19]在单个值上达成共识。假设我们有三个服务器[S0, S1, S2]、寄存器[R0, R1, …]和客户端[C0, C1, ...]其中客户端只能写入指定的寄存器子集。{S0, S1}我们对所有寄存器( 、{S0, S2}、 )使用简单多数仲裁{S1, S2}

Putting all these rules together, we can implement a generalized Paxos algorithm that achieves consensus over a single value using write-once registers [HOWARD19]. Let’s say we have three servers [S0, S1, S2], registers [R0, R1, …], and clients [C0, C1, ...], where the client can only write to the assigned subset of registers. We use simple majority quorums for all registers ({S0, S1}, {S0, S2}, {S1, S2}).

这里的决策过程包括两个阶段。第一阶段确保向寄存器写入值是安全的,第二阶段将值写入寄存器:

The decision process here consists of two phases. The first phase ensures that it is safe to write a value to the register, and the second phase writes the value to the register:

在第一阶段期间
During phase 1

客户端通过向服务器发送命令来检查它要写入的寄存器是否未被写入。如果寄存器未写入,则所有寄存器都将设置为,这会阻止客户端写入先前的寄存器。服务器以迄今为止写入的一组寄存器进行响应。如果客户端收到来自大多数服务器的响应,则客户端会选择具有最大索引的寄存器中的非空值,或者在不存在值的情况下选择自己的值。否则,它将重新开始第一阶段。P1A(register)register - 1nil

The client checks if the register it is about to write is unwritten by sending a P1A(register) command to the server. If the register is unwritten, all registers up to register - 1 are set to nil, which prevents clients from writing to previous registers. The server responds with a set of registers written so far. If it receives responses from the majority of servers, the client chooses either the nonempty value from the register with the largest index or its own value in case no value is present. Otherwise, it restarts the first phase.

在第 2 阶段期间
During phase 2

客户端通过发送通知所有服务器它在第一阶段选择的值。如果大多数服务器响应该消息,则可以输出决策值。否则,从阶段 1 重新开始。P2A(register, value)

The client notifies all servers about the value it has picked during the first phase by sending them P2A(register, value). If the majority of servers respond to this message, it can output the decision value. Otherwise, it starts again from phase 1.

图 14-9显示了 Paxos 的这种概括(改编自[HOWARD19])。客户C0尝试承诺价值V。在第一步中,它的状态表是空的,并且服务器S0S1响应都设置了空的寄存器,这表明到目前为止还没有写入任何寄存器。在第二步中,它可以提交其值V,因为没有写入其他值。

Figure 14-9 shows this generalization of Paxos (adapted from [HOWARD19]). Client C0 tries to commit value V. During the first step, its state table is empty, and servers S0 and S1 respond with the empty register set, indicating that no registers were written so far. During the second step, it can submit its value V, since no other value was written.

1409 号
图 14-9。Paxos 的泛化

此时,任何其他客户端都可以查询服务器以了解当前状态。法定人数{S0, S1}已达到Decided A状态,并且法定人数{S0, S2}{S1, S2}已达到Maybe V状态R0,因此C1选择值V。此时,任何客户端都无法决定 以外的值V

At that point, any other client can query servers to find out the current state. Quorum {S0, S1} has reached Decided A state, and quorums {S0, S2} and {S1, S2} have reached the Maybe V state for R0, so C1 chooses the value V. At that point, no client can decide on a value other than V.

这种方法有助于理解 Paxos 的语义。我们可以根据最后已知的状态来思考,而不是从远程参与者交互的角度来思考状态(例如,提议者找出接受者是否已经接受了不同的提议),从而使我们的决策过程变得简单并消除可能的歧义。不可变状态和消息传递也可以更容易正确实现。

This approach helps to understand the semantics of Paxos. Instead of thinking about the state from the perspective of interactions of remote actors (e.g., a proposer finding out whether or not an acceptor has already accepted a different proposal), we can think in terms of the last known state, making our decision process simple and removing possible ambiguities. Immutable state and message passing can also be easier to implement correctly.

我们还可以与原始 Paxos 进行比较。例如,在客户端发现之前的寄存器组之一有决定的情况下Maybe V,它会拾取V并尝试再次提交该决定,这类似于Paxos中的提议者在失败后提出值的方式能够将值提交给至少一个接受者的前一个提议者。类似地,如果在 Paxos 中,通过使用更高的提案编号重新启动投票来解决领导者冲突,则在通用算法中,任何未写入的较低排名寄存器都将设置为nil

We can also draw parallels with original Paxos. For example, in a scenario in which the client finds that one of the previous register sets has the Maybe V decision, it picks up V and attempts to commit it again, which is similar to how a proposer in Paxos can propose the value after the failure of the previous proposer that was able to commit the value to at least one acceptor. Similarly, if in Paxos leader conflicts are resolved by restarting the vote with a higher proposal number, in the generalized algorithm any unwritten lower-ranked registers are set to nil.

Raft

帕克索斯十多年来一直是共识算法,但在分布式系统社区中,它被认为难以推理。2013年,出现了一种名为Raft的新算法。开发它的研究人员希望创建一种易于理解和实现的算法。它首次在题为“寻找可理解的共识算法”的论文中提出[ONGARO14]

Paxos was the consensus algorithm for over a decade, but in the distributed systems community it’s been known as difficult to reason about. In 2013, a new algorithm called Raft appeared. The researchers who developed it wanted to create an algorithm that’s easy to understand and implement. It was first presented in a paper titled “In Search of an Understandable Consensus Algorithm” [ONGARO14].

分布式系统有足够的固有复杂性,并且拥有更简单的算法是非常可取的。除了一篇论文之外,作者还发布了一个名为LogCabin的参考实现,以解决可能存在的歧义并帮助未来的实现者更好地理解。

There’s enough inherent complexity in distributed systems, and having simpler algorithms is very desirable. Along with a paper, the authors have released a reference implementation called LogCabin to resolve possible ambiguities and help future implementors to gain a better understanding.

参与者在本地存储包含状态机执行的命令序列的日志。由于进程接收的输入是相同的,并且日志包含相同顺序的相同命令,因此将这些命令应用于状态机可保证相同的输出。Raft 通过使领导者的概念成为一等公民来简化共识。领导者用于协调状态机操作和复制。Raft 和原子广播算法以及 Multi-Paxos 之间有许多相似之处:单个领导者从副本中出现,做出原子决策并建立消息顺序。

Locally, participants store a log containing the sequence of commands executed by the state machine. Since inputs that processes receive are identical and logs contain the same commands in the same order, applying these commands to the state machine guarantees the same output. Raft simplifies consensus by making the concept of leader a first-class citizen. A leader is used to coordinate state machine manipulation and replication. There are many similarities between Raft and atomic broadcast algorithms, as well as Multi-Paxos: a single leader emerges from replicas, makes atomic decisions, and establishes the message order.

Raft 中的每个参与者都可以扮演以下三种角色之一:

Each participant in Raft can take one of three roles:

候选人
Candidate

领导力是一个暂时的条件,任何参与者都可以担任这个角色。要成为领导者,节点首先必须转变为候选状态,并尝试收集多数选票。如果候选人既没有赢得选举也没有输掉选举(选票分配给多名候选人,并且没有一人获得多数票),则确定新任期并重新开始选举。

Leadership is a temporary condition, and any participant can take this role. To become a leader, the node first has to transition into a candidate state, and attempt to collect a majority of votes. If a candidate neither wins nor loses the election (the vote is split between multiple candidates and none of them has a majority of votes), the new term is slated and election restarts.

领导者
Leader

当前的临时集群领导者,处理客户端请求并与复制状态机交互。选举领导者的任期称为任期。每个术语由单调递增的数字标识,并且可以持续任意时间段。如果当前的领导者崩溃、无响应或被其他进程怀疑失败(这种情况可能是由于网络分区和消息延迟而发生),则会选出新的领导者。

A current, temporary cluster leader that handles client requests and interacts with a replicated state machine. The leader is elected for a period called a term. Each term is identified by a monotonically increasing number and may continue for an arbitrary time period. A new leader is elected if the current one crashes, becomes unresponsive, or is suspected by other processes to have failed, which can happen because of network partitions and message delays.

追随者
Follower

被动参与者,保留日志条目并响应领导者和候选人的请求。Raft中的Follower是类似于Paxos中的acceptorlearner的角色。每个进程都是从追随者开始的。

A passive participant that persists log entries and responds to requests from the leader and candidates. Follower in Raft is a role similar to acceptor and learner from Paxos. Every process begins as a follower.

为了保证全局偏序而不依赖于时钟同步,时间被分为多个术语(也称为纪元),在此期间领导者是唯一且稳定的。术语是单调编号的,并且每个命令由术语编号和术语[HOWARD14]内的消息编号唯一标识。

To guarantee global partial ordering without relying on clock synchronization, time is divided into terms (also called epoch), during which the leader is unique and stable. Terms are monotonically numbered, and each command is uniquely identified by the term number and the message number within the term [HOWARD14].

不同的参与者可能会在当前的任期上存在分歧因为他们可以在不同的时间了解新的任期,或者可能错过一个或多个任期的领导者选举。由于每条消息都包含一个术语标识符,因此如果其中一个参与者发现其术语已过时,则会将该术语更新为编号较高的术语[ONGARO14]。这意味着在任何给定时间点可能有多个术语正在运行,但如果发生冲突,编号较大的术语获胜。仅当节点开始新的选举过程或发现其任期已过时时,节点才会更新任期。

It may happen that different participants disagree on which term is current, since they can find out about the new term at different times, or could have missed the leader election for one or multiple terms. Since each message contains a term identifier, if one of the participants discovers that its term is out-of-date, it updates the term to the higher-numbered one [ONGARO14]. This means that there may be several terms in flight at any given point in time, but the higher-numbered one wins in case of a conflict. A node updates the term only if it starts a new election process or finds out that its term is out-of-date.

在启动时,或者每当追随者没有收到来自领导者的消息并怀疑它已经崩溃时,它就会启动领导者选举过程。参与者尝试通过转变为候选状态并收集大多数节点的选票来成为领导者。

On startup, or whenever a follower doesn’t receive messages from the leader and suspects that it has crashed, it starts the leader election process. A participant attempts to become a leader by transitioning into the candidate state and collecting votes from the majority of nodes.

图 14-10显示了代表 Raft 算法主要组件的序列图:

Figure 14-10 shows a sequence diagram representing the main components of the Raft algorithm:

领导人选举
Leader election

候选进程向其他进程P1发送消息。RequestVote此消息包括候选者的术语、其已知的最后一个术语以及其观察到的最后一个日志条目的 ID。获得多数选票后,候选人成功当选为该任期的领导者。每个进程最多可以将票投给一名候选人。

Candidate P1 sends a RequestVote message to the other processes. This message includes the candidate’s term, the last term known by it, and the ID of the last log entry it has observed. After collecting a majority of votes, the candidate is successfully elected as a leader for the term. Each process can give its vote to at most one candidate.

周期性心跳
Periodic heartbeats

该协议使用心跳机制来保证参与者的活跃度。领导者定期向所有追随者发送心跳以维持其任期。如果关注者在一段时间内没有收到新的心跳称为选举超时,它假设领导者失败并开始新的选举。

The protocol uses a heartbeat mechanism to ensure the liveness of participants. The leader periodically sends heartbeats to all followers to maintain its term. If a follower doesn’t receive new heartbeats for a period called an election timeout, it assumes that the leader has failed and starts a new election.

日志复制/广播
Log replication / broadcast

领导者可以通过发送AppendEntries消息将新值重复附加到复制日志中。该消息包括领导者的术语、索引、紧邻其当前发送的日志条目之前的日志条目的术语,以及要存储的一个或多个条目。

The leader can repeatedly append new values to the replicated log by sending AppendEntries messages. The message includes the leader’s term, index, and term of the log entry that immediately precedes the ones it’s currently sending, and one or more entries to store.

数据库1410
图 14-10。Raft共识算法总结

Raft 中的 Leader 角色

Leader Role in Raft

A只能从持有所有已提交条目的节点中选举出领导者:如果在选举期间,跟随者的日志信息是最新的(换句话说,具有更高的术语ID,或者更长的日志条目序列,如果术语是等于),则其投票被拒绝。

A leader can be elected only from the nodes holding all committed entries: if during the election, the follower’s log information is more up-to-date (in other words, has a higher term ID, or a longer log entry sequence, if terms are equal) than the candidate’s, its vote is denied.

要赢得选票,候选人必须获得多数选票。条目总是按顺序复制,因此比较最新条目的 ID 始终足以了解其中一个参与者是否是最新的。

To win the vote, a candidate has to collect a majority of votes. Entries are always replicated in order, so it is always enough to compare IDs of the latest entries to understand whether or not one of the participants is up-to-date.

一旦当选,领导者必须接受客户端请求(也可以从其他节点转发给它)并将其复制给追随者。这是通过将条目附加到其日志并将其并行发送给所有关注者来完成的。

Once elected, the leader has to accept client requests (which can also be forwarded to it from other nodes) and replicate them to the followers. This is done by appending the entry to its log and sending it to all the followers in parallel.

当追随者收到AppendEntries消息时,它将消息中的条目附加到本地日志,并确认该消息,让领导者知道它已被持久化。一旦有足够多的副本发送确认,该条目就被视为已提交,并在领导者日志中进行相应标记。

When a follower receives an AppendEntries message, it appends the entries from the message to the local log, and acknowledges the message, letting the leader know that it was persisted. As soon as enough replicas send their acknowledgments, the entry is considered committed and is marked correspondingly in the leader log.

由于只有最新的候选人才能成为领导者,因此追随者永远不必使领导者保持最新状态,并且日志条目仅从领导者流向追随者,反之亦然。

Since only the most up-to-date candidates can become a leader, followers never have to bring the leader up-to-date, and log entries are only flowing from leader to follower and not vice versa.

图14-11展示了这个过程:

Figure 14-11 shows this process:

  • a) 新命令x = 8被附加到领导者的日志中。

  • a) A new command x = 8 is appended to the leader’s log.

  • b) 在提交价值之前,必须将其复制给大多数参与者。

  • b) Before the value can be committed, it has to be replicated to the majority of participants.

  • c) 一旦领导者完成复制,它就会在本地提交该值。

  • c) As soon as the leader is done with replication, it commits the value locally.

  • d) 提交决策被复制给关注者。

  • d) The commit decision is replicated to the followers.

数据库1411
图 14-11。Raft 中以领导者身份提交的过程P1

图 14-12显示了共识回合的示例P₁,其中领导者拥有事件的最新视图。领导者将条目复制给追随者,并在收集确认后提交它们。提交一个条目也会提交日志中该条目​​之前的所有条目。只有领导者才能决定该条目是否可以提交。每个日志条目都标有术语 ID(每个日志条目框右上角的数字)和日志索引,用于标识其在日志中的位置。提交的条目保证被复制到参与者的法定数量,并且可以按照它们在日志中出现的顺序安全地应用到状态机。

Figure 14-12 shows an example of a consensus round where P₁ is a leader, which has the most recent view of the events. The leader proceeds by replicating the entries to the followers, and committing them after collecting acknowledgments. Committing an entry also commits all entries preceding it in the log. Only the leader can make a decision on whether or not the entry can be committed. Each log entry is marked with a term ID (a number in the top-right corner of each log entry box) and a log index, identifying its position in the log. Committed entries are guaranteed to be replicated to the quorum of participants and are safe to be applied to the state machine in the order they appear in the log.

数据库1412
图 14-12。筏状态机

失败场景

Failure Scenarios

什么时候多个追随者决定成为候选人,并且没有一个候选人能够获得多数选票,情况是称为分裂投票。Raft 使用随机计时器来降低后续多次选举以分裂投票告终的可能性。其中一位候选人可以提前开始下一轮选举并收集足够的选票,而其他候选人则可以睡觉并让位。这种方法可以加快选举速度,而不需要候选人之间进行任何额外的协调。

When multiple followers decide to become candidates, and no candidate can collect a majority of votes, the situation is called a split vote. Raft uses randomized timers to reduce the probability of multiple subsequent elections ending up in a split vote. One of the candidates can start the next election round earlier and collect enough votes, while the others sleep and give way to it. This approach speeds up the election without requiring any additional coordination between candidates.

追随者可能会情绪低落或反应缓慢,领导者必须尽最大努力确保消息传递。如果在预期时间范围内没有收到确认,它可以尝试再次发送消息。作为性能优化,它可以并行发送多个消息。

Followers may be down or slow to respond, and the leader has to make the best effort to ensure message delivery. It can try sending messages again if it doesn’t receive an acknowledgment within the expected time bounds. As a performance optimization, it can send multiple messages in parallel.

由于领导者复制的条目是唯一标识的,因此可以保证重复的消息传递不会破坏日志顺序。追随者使用其序列 ID 删除重复消息,确保双重传递不会产生不良副作用。

Since entries replicated by the leader are uniquely identified, repeated message delivery is guaranteed not to break the log order. Followers deduplicate messages using their sequence IDs, ensuring that double delivery has no undesired side effects.

序列 ID 还用于确保日志排序。如果领导者发送的前一个条目的 ID 和术语与根据其自己的记录的最高条目不匹配,则追随者会拒绝编号较高的条目。如果不同副本上的两个日志中的条目具有相同的术语和相同的索引,则它们存储相同的命令,并且它们之前的所有条目都相同。

Sequence IDs are also used to ensure the log ordering. A follower rejects a higher-numbered entry if the ID and term of the entry that immediately precedes it, sent by the leader, do not match the highest entry according to its own records. If entries in two logs on different replicas have the same term and the same index, they store the same command and all entries that precede them are the same.

Raft 保证永远不会将未提交的消息显示为已提交的消息,但是,由于网络或副本缓慢,已提交的消息仍然可以被视为正在进行中这是一个相当无害的属性,可以通过重试客户端命令来解决,直到它终于被提交了[HOWARD14]

Raft guarantees to never show an uncommitted message as a committed one, but, due to network or replica slowness, already committed messages can still be seen as in progress, which is a rather harmless property and can be worked around by retrying a client command until it is finally committed [HOWARD14].

为了检测故障,领导者必须向追随者发送心跳。这样,领导者就可以维持任期。当其中一个节点注意到当前领导者已关闭时,它会尝试发起选举。新当选的领导者必须将集群状态恢复到最后已知的最新日志条目。它这样做的方法是找到一个共同点(领导者和追随者都同意的最高日志条目),并命令追随者丢弃此点之后附加的所有(未提交的)条目。然后,它发送日志中的最新条目,覆盖关注者的历史记录。领导者自己的日志记录永远不会被删除或覆盖:它只能将条目附加到自己的日志中。

For failure detection, the leader has to send heartbeats to the followers. This way, the leader maintains its term. When one of the nodes notices that the current leader is down, it attempts to initiate the election. The newly elected leader has to restore the state of the cluster to the last known up-to-date log entry. It does so by finding a common ground (the highest log entry on which both the leader and follower agree), and ordering followers to discard all (uncommitted) entries appended after this point. It then sends the most recent entries from its log, overwriting the followers’ history. The leader’s own log records are never removed or overwritten: it can only append entries to its own log.

总结起来,Raft算法提供了以下保证:

Summing up, the Raft algorithm provides the following guarantees:

  • 在给定的任期内,一次只能选举一名领导人;没有两个领导人可以在同一任期内活跃。

  • Only one leader can be elected at a time for a given term; no two leaders can be active during the same term.

  • 领导者不会删除或重新排序其日志内容;它只向其附加新消息。

  • The leader does not remove or reorder its log contents; it only appends new messages to it.

  • 已提交的日志条目保证出现在后续领导者的日志中并且无法恢复,因为在提交条目之前已知它已被领导者复制。

  • Committed log entries are guaranteed to be present in logs for subsequent leaders and cannot get reverted, since before the entry is committed it is known to be replicated by the leader.

  • 所有消息均由消息 ID 和术语 ID 唯一标识;当前和后续领导者都不能为不同的条目重复使用相同的标识符。

  • All messages are identified uniquely by the message and term IDs; neither current nor subsequent leaders can reuse the same identifier for the different entry.

Raft 自出现以来就变得非常流行,目前在许多数据库和其他分布式系统中都有使用,包括CockroachDBEtcdConsul。这可以归因于它的简单性,但也可能意味着 Raft 不辜负成为可靠共识算法的承诺。

Since its appearance, Raft has become very popular and is currently used in many databases and other distributed systems, including CockroachDB, Etcd, and Consul. This can be attributed to its simplicity, but also may mean that Raft lives up to the promise of being a reliable consensus algorithm.

拜占庭共识

Byzantine Consensus

全部到目前为止,我们一直在讨论的共识算法假设非拜占庭故障(请参阅“任意故障”)。换句话说,节点“善意”地执行算法,不会试图利用它或伪造结果。

All the consensus algorithms we have been discussing so far assume non-Byzantine failures (see “Arbitrary Faults”). In other words, nodes execute the algorithm in “good faith” and do not try to exploit it or forge the results.

正如我们将看到的,这种假设允许与较少数量的可用参与者达成共识,并且提交所需的往返次数也较少。然而,分布式系统有时部署在潜在的敌对环境中,其中节点不受同一实体控制,我们需要算法来确保即使某些节点行为不稳定甚至恶意,系统也能正常运行。除了恶意之外,拜占庭故障也可能是由错误、配置错误、硬件问题或数据损坏引起的。

As we will see, this assumption allows achieving consensus with a smaller number of available participants and with fewer round-trips required for a commit. However, distributed systems are sometimes deployed in potentially adversarial environments, where the nodes are not controlled by the same entity, and we need algorithms that can ensure a system can function correctly even if some nodes behave erratically or even maliciously. Besides ill intentions, Byzantine failures can also be caused by bugs, misconfiguration, hardware issues, or data corruption.

大多数拜占庭共识算法需要消息来完成算法步骤,其中是仲裁的大小,因为仲裁中的每个节点都必须相互通信。这是针对其他节点交叉验证每个步骤所必需的,因为节点不能相互依赖或依赖领导者,并且必须通过将返回的结果与多数响应进行比较来验证其他节点的行为。N2N

Most Byzantine consensus algorithms require N2 messages to complete an algorithm step, where N is the size of the quorum, since each node in the quorum has to communicate with each other. This is required to cross-validate each step against other nodes, since nodes cannot rely on each other or on the leader and have to verify other nodes’ behaviors by comparing returned results with the majority responses.

我们在这里只讨论一种拜占庭共识算法,实用拜占庭容错(PBFT)[CASTRO99]。PBFT 假设独立的节点故障(即故障可以协调,但整个系统不能立即被接管,或者至少不能使用相同的利用方法)。系统会做出弱同步假设,就像您期望网络正常运行一样:可能会发生故障,但它们不是无限期的,并且最终会从中恢复。

We’ll only discuss one Byzantine consensus algorithm here, Practical Byzantine Fault Tolerance (PBFT) [CASTRO99]. PBFT assumes independent node failures (i.e., failures can be coordinated, but the entire system cannot be taken over at once, or at least with the same exploit method). The system makes weak synchrony assumptions, like how you would expect a network to behave normally: failures may occur, but they are not indefinite and are eventually recovered from.

节点之间的所有通信都是加密的,这有助于防止消息伪造和网络攻击。副本知道彼此的公钥以验证身份并加密消息。故障节点可能会泄漏系统内部的信息,因为即使使用加密,每个节点也需要解释消息内容以对其做出反应。这不会破坏算法,因为它有不同的目的。

All communication between the nodes is encrypted, which serves to prevent message forging and network attacks. Replicas know one another’s public keys to verify identities and encrypt messages. Faulty nodes may leak information from inside the system, since, even though encryption is used, every node needs to interpret message contents to react upon them. This doesn’t undermine the algorithm, since it serves a different purpose.

PBFT算法

PBFT Algorithm

PBFT 要保证安全性和活跃性,最多只能有(n - 1)/3副本出现故障(其中n是参与者总数)。对于一个系统来说,要维持f受损的节点,就需要至少有n = 3f + 1节点。出现这种情况是因为大多数节点必须就该值达成一致:f副本可能有故障,并且可能有f副本没有响应但可能没有故障(例如,由于网络分区、电源故障或维护) )。该算法必须能够从无故障副本收集足够的响应,以使其数量仍然超过有故障的副本。

For PBFT to guarantee both safety and liveness, no more than (n - 1)/3 replicas can be faulty (where n is the total number of participants). For a system to sustain f compromised nodes, it is required to have at least n = 3f + 1 nodes. This is the case because a majority of nodes have to agree on the value: f replicas might be faulty, and there might be f replicas that are not responding but may not be faulty (for example, due to a network partition, power failure, or maintenance). The algorithm has to be able to collect enough responses from nonfaulty replicas to still outnumber those from the faulty ones.

PBFT 的共识属性与其他共识算法类似:所有无故障副本都必须就接收到的值集及其顺序达成一致,尽管可能会出现故障。

Consensus properties for PBFT are similar to those of other consensus algorithms: all nonfaulty replicas have to agree both on the set of received values and their order, despite the possible failures.

为了区分集群配置,PBFT 使用视图。在每个视图中,其中一个副本是主副本,其余副本被视为备份。所有节点均连续编号,主节点索引为v mod N,其中v为视图ID,N为当前配置的节点数。当主数据库发生故障时,视图可能会发生变化。客户端针对主节点执行操作。主服务器将请求广播到备份服务器,备份服务器执行请求并将响应发送回客户端。客户端等待f + 1副本响应相同的结果,以便任何操作成功。

To distinguish between cluster configurations, PBFT uses views. In each view, one of the replicas is a primary and the rest of them are considered backups. All nodes are numbered consecutively, and the index of the primary node is v mod N, where v is the view ID, and N is the number of nodes in the current configuration. The view can change in cases when the primary fails. Clients execute their operations against the primary. The primary broadcasts the requests to the backups, which execute the requests and send a response back to the client. The client waits for f + 1 replicas to respond with the same result for any operation to succeed.

主节点收到客户端请求后,协议执行分三个阶段进行:

After the primary receives a client request, protocol execution proceeds in three phases:

前期准备
Pre-prepare

主节点广播一条消息,其中包含视图 ID、唯一单调递增标识符、有效负载(客户端请求)和有效负载摘要。摘要是使用强大的抗冲突哈希函数计算的,并由发送者签名。如果备份的视图与主视图匹配并且客户端请求未被篡改,则备份将接受该消息:计算出的有效负载摘要与接收到的有效负载摘要匹配。

The primary broadcasts a message containing a view ID, a unique monotonically increasing identifier, a payload (client request), and a payload digest. Digests are computed using a strong collision-resistant hash function, and are signed by the sender. The backup accepts the message if its view matches with the primary view and the client request hasn’t been tampered with: the calculated payload digest matches the received one.

准备
Prepare

如果备份接受预准备消息,它将进入准备阶段并开始向Prepare所有其他副本(包括主副本)广播消息,其中包含视图 ID、消息 ID 和有效负载摘要,但不包含有效负载本身。仅当副本从不同备份接收到与预准备期间收到的消息相匹配的2f准备时,副本才能越过准备状态:它们必须具有相同的视图、相同的 ID 和摘要。

If the backup accepts the pre-prepare message, it enters the prepare phase and starts broadcasting Prepare messages, containing a view ID, message ID, and a payload digest, but without the payload itself, to all other replicas (including the primary). Replicas can move past the prepare state only if they receive 2f prepares from different backups that match the message received during pre-prepare: they have to have the same view, same ID, and a digest.

犯罪
Commit

之后,备份进入提交阶段,向所有其他副本广播消息,并等待从其他参与者Commit收集2f + 1匹配的消息(可能包括它自己的消息)。Commit

After that, the backup moves to the commit phase, where it broadcasts Commit messages to all other replicas and waits to collect 2f + 1 matching Commit messages (possibly including its own) from the other participants.

在这种情况下,摘要用于在准备阶段减少消息大小,因为不需要重新广播整个有效负载进行验证,因为摘要充当有效负载摘要加密哈希函数具有抗冲突性:很难生成具有相同摘要的两个值,更不用说具有在系统上下文中有意义的匹配摘要的两条消息了。此外,还会对摘要进行签名,以确保摘要本身来自可信来源。

A digest in this case is used to reduce the message size during the prepare phase, since it’s not necessary to rebroadcast an entire payload for verification, as the digest serves as a payload summary. Cryptographic hash functions are resistant to collisions: it is difficult to produce two values that have the same digest, let alone two messages with matching digests that make sense in the context of the system. In addition, digests are signed to make sure that the digest itself is coming from a trusted source.

这个数字2f很重要,因为算法必须确保至少f + 1无故障的副本响应客户端。

The number 2f is important, since the algorithm has to make sure that at least f + 1 nonfaulty replicas respond to the client.

图 14-13显示了正常情况下的 PBFT 算法回合的序列图:客户端向 发送请求,节点通过从行为正常的对等方收集足够数量的匹配响应来在阶段之间移动。可能已失败或可能已响应不匹配的消息,因此其响应不会被计算在内。P1P4

Figure 14-13 shows a sequence diagram of a normal-case PBFT algorithm round: the client sends a request to P1, and nodes move between phases by collecting a sufficient number of matching responses from properly behaving peers. P4 may have failed or could’ve responded with unmatching messages, so its responses wouldn’t have been counted.

数据库1413
图 14-13。PBFT共识,正常情况运行

在准备和提交阶段,节点通过向其他节点发送消息并等待来自相应数量的其他节点的消息来进行通信,以检查它们是否匹配并确保不广播错误的消息。对等节点交叉验证所有消息,以便只有无故障的节点才能成功提交消息。如果无法收集足够数量的匹配消息,则节点不会移动到下一步。

During the prepare and commit phases, nodes communicate by sending messages to each other node and waiting for the messages from the corresponding number of other nodes, to check if they match and make sure that incorrect messages are not broadcasted. Peers cross-validate all messages so that only nonfaulty nodes can successfully commit messages. If a sufficient number of matching messages cannot be collected, the node doesn’t move to the next step.

当副本收集到足够的提交消息时,它们会通知客户端,从而完成一轮。在收到匹配的响应之前,客户端无法确定执行是否正确完成f + 1

When replicas collect enough commit messages, they notify the client, finishing the round. The client cannot be certain about whether or not execution was fulfilled correctly until it receives f + 1 matching responses.

当副本发现主数据库处于非活动状态并怀疑它可能发生故障时,就会发生视图更改。检测到主要故障的节点停止响应进一步的消息(检查点和视图更改相关的消息除外),广播视图更改通知,并等待确认。当新视图的主视图接收到2f视图更改事件时,它会启动一个新视图。

View changes occur when replicas notice that the primary is inactive, and suspect that it might have failed. Nodes that detect a primary failure stop responding to further messages (apart from checkpoint and view-change related ones), broadcast a view change notification, and wait for confirmations. When the primary of the new view receives 2f view change events, it initiates a new view.

为了减少协议中的消息数量,客户端可以2f + 1暂时执行请求的节点收集匹配响应(例如,在它们收集了足够数量的匹配Prepared消息之后)。如果客户端无法收集足够的匹配暂定响应,它将重试并等待f + 1非暂定响应,如前所述。

To reduce the number of messages in the protocol, clients can collect 2f + 1 matching responses from nodes that tentatively execute a request (e.g., after they’ve collected a sufficient number of matching Prepared messages). If the client cannot collect enough matching tentative responses, it retries and waits for f + 1 nontentative responses as described previously.

PBFT 中的只读操作只需一次往返即可完成。客户端向所有副本发送读请求。在提交对读取值的所有正在进行的状态更改之后,副本在其暂定状态下执行请求,并响应客户端。从不同副本收集2f + 1到具有相同值的响应后,操作完成。

Read-only operations in PBFT can be done in just one round-trip. The client sends a read request to all replicas. Replicas execute the request in their tentative states, after all ongoing state changes to the read value are committed, and respond to the client. After collecting 2f + 1 responses with the same value from different replicas, the operation completes.

恢复和检查点

Recovery and Checkpointing

复制品将接受的消息保存在稳定的日志中。每条消息都必须保留,直到它至少被f + 1节点执行为止。在网络分区的情况下,此日志可用于使其他副本加快速度,但恢复副本需要某种方法来验证它们接收的状态是否正确,因为否则恢复可用作攻击媒介。

Replicas save accepted messages in a stable log. Every message has to be kept until it has been executed by at least f + 1 nodes. This log can be used to get other replicas up to speed in case of a network partition, but recovering replicas need some means of verifying that the state they receive is correct, since otherwise recovery can be used as an attack vector.

为了表明状态是正确的,节点计算给定序列号之前的消息的状态摘要。节点可以比较摘要,验证状态完整性,并确保它们在恢复期间收到的消息加起来达到正确的最终状态。此过程的成本太高,无法对每个请求执行。

To show that the state is correct, nodes compute a digest of the state for messages up to a given sequence number. Nodes can compare digests, verify state integrity, and make sure that messages they received during recovery add up to a correct final state. This process is too expensive to perform on every request.

每个N请求,其中N是一个可配置常量,主节点会创建一个稳定的检查点,在其中广播其执行反映在状态中的最新请求的最新序列号以及该状态的摘要。然后它等待2f + 1副本响应。这些响应构成了此检查点的证明,并保证副本可以安全地丢弃所有预准备、准备、提交和检查点消息(直到给定序列号)的状态。

After every N requests, where N is a configurable constant, the primary makes a stable checkpoint, where it broadcasts the latest sequence number of the latest request whose execution is reflected in the state, and the digest of this state. It then waits for 2f + 1 replicas to respond. These responses constitute a proof for this checkpoint, and a guarantee that replicas can safely discard state for all pre-prepare, prepare, commit, and checkpoint messages up to the given sequence number.

拜占庭容错对于理解和用于部署在潜在敌对网络中的存储系统至关重要。大多数时候,对节点间通信进行身份验证和加密就足够了,但是当系统各部分之间不存在信任时,就必须采用类似于 PBFT 的算法。

Byzantine fault tolerance is essential to understand and is used in storage systems deployed in potentially adversarial networks. Most of the time, it is enough to authenticate and encrypt internode communication, but when there’s no trust between the parts of the system, algorithms similar to PBFT have to be employed.

由于抵抗拜占庭错误的算法在交换消息的数量方面会带来巨大的开销,因此了解它们的用例非常重要。其他协议,例如[BAUDET19][BUCHMAN18]中描述的协议,尝试针对具有大量参与者的系统优化 PBFT 算法。

Since algorithms resistant to Byzantine faults impose significant overhead in terms of the number of exchanged messages, it is important to understand their use cases. Other protocols, such as the ones described in [BAUDET19] and [BUCHMAN18], attempt to optimize the PBFT algorithm for systems with a large number of participants.

概括

Summary

共识算法是分布式系统中最有趣但最复杂的主题之一。在过去的几年里,新的算法和现有算法的许多实现已经出现,这证明了该主题的重要性和受欢迎程度不断上升。

Consensus algorithms are one of the most interesting yet most complex subjects in distributed systems. Over the last few years, new algorithms and many implementations of the existing algorithms have emerged, which proves the rising importance and popularity of the subject.

在本章中,我们讨论了经典的 Paxos 算法以及 Paxos 的几种变体,每种算法都改进了其不同的属性:

In this chapter, we discussed the classic Paxos algorithm, and several variants of Paxos, each one improving its different properties:

多Paxos
Multi-Paxos

允许提议者保留其角色并复制多个值而不是仅一个。

Allows a proposer to retain its role and replicate multiple values instead of just one.

快速 Paxos
Fast Paxos

当接受者可以继续处理来自既定领导者之外的提议者的消息时,允许我们通过使用快速轮次来减少消息数量。

Allows us to reduce a number of messages by using fast rounds, when acceptors can proceed with messages from proposers other than the established leader.

埃帕索斯
EPaxos

通过解决提交的消息之间的依赖关系来建立事件顺序。

Establishes event order by resolving dependencies between submitted messages.

灵活的Paxos
Flexible Paxos

放宽法定人数要求,仅需要第一阶段(投票)的法定人数与第二阶段(复制)的法定人数交叉。

Relaxes quorum requirements and only requires a quorum for the first phase (voting) to intersect with a quorum for the second phase (replication).

Raft 简化了描述共识的术语,并使领导力成为算法中的一等公民。Raft 将日志复制、领导者选举和安全性分开。

Raft simplifies the terms in which consensus is described, and makes leadership a first-class citizen in the algorithm. Raft separates log replication, leader election, and safety.

为了保证对抗环境下的共识安全,应使用拜占庭容错算法;例如,PBFT。在 PBFT 中,参与者交叉验证彼此的响应,并且只有当有足够多的节点遵守规定的算法规则时才继续执行执行步骤。

To guarantee consensus safety in adversarial environments, Byzantine fault-tolerant algorithms should be used; for example, PBFT. In PBFT, participants cross-validate one another’s responses and only proceed with execution steps when there’s enough nodes that obey the prescribed algorithm rules.

1例如,https://databass.dev/links/68中描述了这种情况。

1 For example, such a situation was described in https://databass.dev/links/68.

第二部分 结论

Part II Conclusion

性能和可扩展性是任何数据库系统的重要属性。存储引擎和节点本地读写路径会对系统的性能产生更大的影响:本地处理请求的速度。同时,集群中负责通信的子系统往往对数据库系统的可扩展性有较大的影响:最大集群规模和容量。但是,如果存储引擎不可扩展并且其性能随着数据集的增长而降低,则只能用于有限数量的用例。同时,将缓慢的原子提交协议放在最快的存储引擎之上不会产生好的结果。

Performance and scalability are important properties of any database system. The storage engine and node-local read-write path can have a larger impact on performance of the system: how quickly it can process requests locally. At the same time, a subsystem responsible for communication in the cluster often has a larger impact on the scalability of the database system: maximum cluster size and capacity. However, the storage engine can only be used for a limited number of use cases if it’s not scalable and its performance degrades as the dataset grows. At the same time, putting a slow atomic commit protocol on top of the fastest storage engine will not yield good results.

分布式、集群范围和节点本地进程是相互关联的,必须整体考虑。在设计数据库系统时,您必须考虑不同的子系统如何配合和协同工作。

Distributed, cluster-wide, and node-local processes are interconnected, and have to be considered holistically. When designing a database system, you have to consider how different subsystems fit and work together.

第二部分首先讨论分布式系统与单节点应用程序的不同之处,以及在这种环境中会遇到哪些困难。

Part II began with a discussion of how distributed systems are different from single-node applications, and which difficulties are to be expected in such environments.

我们讨论了基本的分布式系统构建块、不同的一致性模型以及几个重要的分布式算法类别,其中一些可用于实现这些一致性模型:

We discussed the basic distributed system building blocks, different consistency models, and several important classes of distributed algorithms, some of which can be used to implement these consistency models:

故障检测
Failure detection

准确有效地识别远程流程故障。

Identify remote process failures accurately and efficiently.

领导人选举
Leader election

快速可靠地选择单个进程临时充当协调员。

Quickly and reliably choose a single process to temporarily serve as a coordinator.

传播
Dissemination

使用点对点通信可靠地分发信息。

Reliably distribute information using peer-to-peer communication.

反熵
Anti-entropy

识别并修复节点之间的状态差异。

Identify and repair state divergence between the nodes.

分布式交易
Distributed transactions

以原子方式对多个分区执行一系列操作。

Execute series of operations against multiple partitions atomically.

共识
Consensus

在远程参与者之间达成协议,同时容忍流程失败。

Reach an agreement between remote participants while tolerating process failures.

这些算法用于许多数据库系统、消息队列、调度程序和其他重要的基础设施软件。利用本书中的知识,您将能够更好地理解它们的工作原理,这反过来又有助于更好地决定使用哪种软件,并识别潜在的问题。

These algorithms are used in many database systems, message queues, schedulers, and other important infrastructure software. Using the knowledge from this book, you’ll be able to better understand how they work, which, in turn, will help to make better decisions about which software to use, and identify potential problems.

进一步阅读

Further Reading

在每一章的末尾,您可以找到与该章中介绍的材料相关的资源。在这里,您将找到可以进一步学习的书籍,涵盖本书中提到的概念和其他概念。该列表并不完整,但这些来源包含许多与数据库系统爱好者相关的重要且有用的信息,其中一些内容未在本书中涵盖:

At the end of each chapter, you can find resources related to the material presented in the chapter. Here, you’ll find books you can address for further study, covering both concepts mentioned in this book and other concepts. This list is not meant to be complete, but these sources contain a lot of important and useful information relevant for database systems enthusiasts, some of which is not covered in this book:

数据库系统
Database systems

伯恩斯坦、菲利普 A.、瓦斯科·哈吉拉科斯和内森·古德曼。1987。数据库系统中的并发控制和恢复。波士顿:艾迪生·韦斯利·朗曼。

亨利·科思 (Henry F.) 和亚伯拉罕·西尔伯沙茨 (Abraham Silberschatz)。1986.数据库系统概念。纽约:麦格劳-希尔。

格雷、吉姆和安德烈亚斯·路透。1992 年。事务处理:概念和技术(第一版)。旧金山:摩根·考夫曼。

迈克尔·斯通布雷克 (Stonebraker) 和约瑟夫·M·海勒斯坦 (Joseph M. Hellerstein)(编)。1998 年。数据库系统读物(第三版)。旧金山:摩根·考夫曼。

维库姆、格哈德和戈特弗里德·沃森。2001.事务信息系统:并发控制和恢复的理论、算法和实践。旧金山:摩根·考夫曼。

拉马克里希南、拉古和约翰内斯·格尔克。2002 年。数据库管理系统(第 3 版)。纽约:麦格劳-希尔。

加西亚-莫利纳、赫克托、杰弗里·D·乌尔曼和詹妮弗·维多姆。2008 年。数据库系统:全书(第 2 版)。新泽西州上萨德尔河:Prentice Hall。

菲利普·A·伯恩斯坦和埃里克·纽科默。2009 年。事务处理原理(第二版)。旧金山:摩根·考夫曼。

埃尔马斯里、拉梅兹和沙姆坎特·纳瓦特。2010。数据库系统基础(第六版)。波士顿:艾迪生-韦斯利。

莱克、彼得·克劳瑟和保罗·克劳瑟。2013.数据库简明指南:实用介绍。纽约:施普林格。

哈德、西奥、卡埃塔诺·绍尔、戈茨·格雷夫和韦·盖伊。2015.通过预写日志记录进行即时恢复。数据银行-Spektrum。

Bernstein, Philip A., Vassco Hadzilacos, and Nathan Goodman. 1987. Concurrency Control and Recovery in Database Systems. Boston: Addison-Wesley Longman.

Korth, Henry F. and Abraham Silberschatz. 1986. Database System Concepts. New York: McGraw-Hill.

Gray, Jim and Andreas Reuter. 1992. Transaction Processing: Concepts and Techniques (1st Ed.). San Francisco: Morgan Kaufmann.

Stonebraker, Michael and Joseph M. Hellerstein (Eds.). 1998. Readings in Database Systems (3rd Ed.). San Francisco: Morgan Kaufmann.

Weikum, Gerhard and Gottfried Vossen. 2001. Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. San Francisco: Morgan Kaufmann.

Ramakrishnan, Raghu and Johannes Gehrke. 2002. Database Management Systems (3 Ed.). New York: McGraw-Hill.

Garcia-Molina, Hector, Jeffrey D. Ullman, and Jennifer Widom. 2008. Database Systems: The Complete Book (2 Ed.). Upper Saddle River, NJ: Prentice Hall.

Bernstein, Philip A. and Eric Newcomer. 2009. Principles of Transaction Processing (2nd Ed.). San Francisco: Morgan Kaufmann.

Elmasri, Ramez and Shamkant Navathe. 2010. Fundamentals of Database Systems (6th Ed.). Boston: Addison-Wesley.

Lake, Peter and Paul Crowther. 2013. Concise Guide to Databases: A Practical Introduction. New York: Springer.

Härder, Theo, Caetano Sauer, Goetz Graefe, and Wey Guy. 2015. Instant recovery with write-ahead logging. Datenbank-Spektrum.

分布式系统
Distributed systems

林奇,南希 A.分布式算法。1996。旧金山:摩根·考夫曼。

阿提亚、哈吉特和詹妮弗·韦尔奇。2004 年。分布式计算:基础知识、模拟和高级主题。新泽西州霍博肯:约翰·威利父子公司。

Birman, Kenneth P. 2005。可靠的分布式系统:技术、Web 服务和应用程序。柏林:施普林格出版社。

卡钦、克里斯蒂安、拉希德·格拉维和卢斯·罗德里格斯。2011。可靠和安全的分布式编程简介(第二版)。纽约:施普林格。

福金克、万. 2013.分布式算法:一种直观的方法。麻省理工学院出版社。

戈什,苏库马尔。分布式系统:算法方法(第二版)。查普曼和霍尔/CRC。

塔南鲍姆·安德鲁·S.和马丁·范·斯蒂恩。2017。分布式系统:原理和范式(第三版)。波士顿:皮尔逊。

Lynch, Nancy A. Distributed Algorithms. 1996. San Francisco: Morgan Kaufmann.

Attiya, Hagit, and Jennifer Welch. 2004. Distributed Computing: Fundamentals, Simulations and Advanced Topics. Hoboken, NJ: John Wiley & Sons.

Birman, Kenneth P. 2005. Reliable Distributed Systems: Technologies, Web Services, and Applications. Berlin: Springer-Verlag.

Cachin, Christian, Rachid Guerraoui, and Lus Rodrigues. 2011. Introduction to Reliable and Secure Distributed Programming (2nd Ed.). New York: Springer.

Fokkink, Wan. 2013. Distributed Algorithms: An Intuitive Approach. The MIT Press.

Ghosh, Sukumar. Distributed Systems: An Algorithmic Approach (2nd Ed.). Chapman & Hall/CRC.

Tanenbaum Andrew S. and Maarten van Steen. 2017. Distributed Systems: Principles and Paradigms (3rd Ed.). Boston: Pearson.

操作数据库
Operating databases

拜尔、贝特西、克里斯·琼斯、詹妮弗·佩托夫和尼尔·理查德·墨菲。2016 年站点可靠性工程:Google 如何运行生产系统(第一版)。波士顿:奥莱利媒体。

Blank-Edelman,David N. 2018。寻求 SRE。波士顿:奥莱利媒体。

坎贝尔、莱恩和慈善专业。2017 年。数据库可靠性工程:设计和操作弹性数据库系统(第一版)。波士顿:奥莱利媒体。+斯里达兰,辛迪。2018。分布式系统可观察性:构建鲁棒系统指南。波士顿:奥莱利媒体。

Beyer, Betsy, Chris Jones, Jennifer Petoff, and Niall Richard Murphy. 2016 Site Reliability Engineering: How Google Runs Production Systems (1st Ed.). Boston: O’Reilly Media.

Blank-Edelman, David N. 2018. Seeking SRE. Boston: O’Reilly Media.

Campbell, Laine and Charity Majors. 2017. Database Reliability Engineering: Designing and Operating Resilient Database Systems (1st Ed.). Boston: O’Reilly Media. +Sridharan, Cindy. 2018. Distributed Systems Observability: A Guide to Building Robust Systems. Boston: O’Reilly Media.

附录 A参考书目

Appendix A. Bibliography

  1. [ABADI12]阿巴迪,丹尼尔。2012 年。“现代分布式数据库系统设计中的一致性权衡:CAP 只是故事的一部分。” 计算机45,没有。2(二月):37-42。https://doi.org/10.1109/MC.2012.33

  2. [ABADI12] Abadi, Daniel. 2012. “Consistency Tradeoffs in Modern Distributed Database System Design: CAP is Only Part of the Story.” Computer 45, no. 2 (February): 37-42. https://doi.org/10.1109/MC.2012.33.

  3. [ABADI17]阿巴迪,丹尼尔。2017 年。“大规模分布式一致性:Spanner 与 Calvin。” 动物群(博客)。2017 年 4 月 6 日。https: //fauna.com/blog/distributed-consistency-at-scale-spanner-vs-calvin

  4. [ABADI17] Abadi, Daniel. 2017. “Distributed consistency at scale: Spanner vs. Calvin.” Fauna (blog). April 6, 2017. https://fauna.com/blog/distributed-consistency-at-scale-spanner-vs-calvin.

  5. [ABADI13]阿巴迪、丹尼尔、彼得·邦茨、斯塔夫罗斯·哈里佐普洛斯、斯特拉托斯·伊德雷奥斯和塞缪尔·马登。2013.现代列式数据库系统的设计与实现。马萨诸塞州汉诺威:Now Publishers Inc.

  6. [ABADI13] Abadi, Daniel, Peter Boncz, Stavros Harizopoulos, Stratos Idreaos, and Samuel Madden. 2013. The Design and Implementation of Modern Column-Oriented Database Systems. Hanover, MA: Now Publishers Inc.

  7. [ABRAHAM13]亚伯拉罕、伊泰、丹尼·多列夫和约瑟夫·Y·哈尔彭。2013.“领导者选举的分布式协议:博弈论的视角。” 《分布式计算》,Yehuda Afek 编辑,61-75。柏林:施普林格、柏林、海德堡。

  8. [ABRAHAM13] Abraham, Ittai, Danny Dolev, and Joseph Y. Halpern. 2013. “Distributed Protocols for Leader Election: A Game-TheoreticPerspective.” In Distributed Computing, edited by Yehuda Afek, 61-75. Berlin: Springer, Berlin, Heidelberg.

  9. [AGGARWAL88]阿加瓦尔、阿洛克和杰弗里·S.维特。1988.“排序的输入/输出复杂性及相关问题。” ACM 31 的通信,编号。9(九月):1116-1127。https://doi.org/10.1145/48529.48535

  10. [AGGARWAL88] Aggarwal, Alok, and Jeffrey S. Vitter. 1988. “The input/output complexity of sorting and related problems.” Communications of the ACM 31, no. 9 (September): 1116-1127. https://doi.org/10.1145/48529.48535.

  11. [AGRAWAL09]阿格拉瓦尔、德韦什、迪帕克·加内桑、拉梅什·西塔拉曼、岩雷·迪奥、沙什·辛格。2009 年。“惰性自适应树:闪存设备的优化索引结构。” VLDB 捐赠论文集2,编号。1(一月):361-372。

  12. [AGRAWAL09] Agrawal, Devesh, Deepak Ganesan, Ramesh Sitaraman, Yanlei Diao, Shashi Singh. 2009. “Lazy-Adaptive Tree: an optimized index structure for flash devices.” Proceedings of the VLDB Endowment 2, no. 1 (January): 361-372.

  13. [AGRAWAL08] Agrawal、Nitin、Vijayan Prabhakaran、Ted Wobber、John D. Davis、Mark Manasse 和 Rina Panigrahy。2008 年。“SSD 性能的设计权衡。” USENIX 2008 年年度技术会议 (ATC '08),57-70。USENIX。

  14. [AGRAWAL08] Agrawal, Nitin, Vijayan Prabhakaran, Ted Wobber, John D. Davis, Mark Manasse, and Rina Panigrahy. 2008. “Design tradeoffs for SSD performance.” USENIX 2008 Annual Technical Conference (ATC ’08), 57-70. USENIX.

  15. [AGUILERA97]阿奎莱拉、Marcos K.、陈伟和 Sam Toueg。1997 年。“Heartbeat:用于静态可靠通信的无超时故障检测器”。在《分布式算法》中,M. Mavronicolas 和 P. Tsigas 编辑,126-140。柏林:施普林格、柏林、海德堡。

  16. [AGUILERA97] Aguilera, Marcos K., Wei Chen, and Sam Toueg. 1997. “Heartbeat: a Timeout-Free Failure Detector for Quiescent Reliable Communication.” In Distributed Algorithms, edited by M. Mavronicolas and P. Tsigas, 126-140. Berlin: Springer, Berlin, Heidelberg.

  17. [AGUILERA01]阿奎莱拉、马科斯·卡瓦佐、卡罗尔·德尔波特-加莱、休斯·福康尼尔和萨姆·图埃格。2001.“稳定的领导者选举”。第 15 届国际分布式计算会议 (DISC '01) 论文集,Jennifer L. Welch 编辑,108-122。伦敦:施普林格出版社。

  18. [AGUILERA01] Aguilera, Marcos Kawazoe, Carole Delporte-Gallet, Hugues Fauconnier, and Sam Toueg. 2001. “Stable Leader Election.” In Proceedings of the 15th International Conference on Distributed Computing (DISC ’01), edited by Jennifer L. Welch, 108-122. London: Springer-Verlag.

  19. [AGUILERA16]阿奎莱拉、MK 和 DB 特里。2016。“一致性的多面性。” 数据工程技术委员会公告39,no。1(三月):3-13。

  20. [AGUILERA16] Aguilera, M. K., and D. B. Terry. 2016. “The Many Faces of Consistency.” Bulletin of the Technical Committee on Data Engineering 39, no. 1 (March): 3-13.

  21. [ALHOUMAILY10] Al-Houmaily, Yousef J. 2010。“原子提交协议、它们的集成以及它们在分布式数据库系统中的优化。” 国际智能信息和数据库系统杂志4,no。4(九月):373-412。https://doi.org/10.1504/IJIIDS.2010.035582

  22. [ALHOUMAILY10] Al-Houmaily, Yousef J. 2010. “Atomic commit protocols, their integration, and their optimisations in distributed database systems.” International Journal of Intelligent Information and Database Systems 4, no. 4 (September): 373–412. https://doi.org/10.1504/IJIIDS.2010.035582.

  23. [ARJOMANDI83] Arjomandi、Eshrat、Michael J. Fischer 和 Nancy A. Lynch。1983 年。“同步与异步分布式系统的效率。” ACM 杂志30,no。3(七月):449-456。https://doi.org/10.1145/2402.322387

  24. [ARJOMANDI83] Arjomandi, Eshrat, Michael J. Fischer, and Nancy A. Lynch. 1983. “Efficiency of Synchronous Versus Asynchronous Distributed Systems.” Journal of the ACM 30, no. 3 (July): 449-456. https://doi.org/10.1145/2402.322387.

  25. [ARULRAJ17] Arulraj, J. 和 A. Pavlo。2017。“如何构建非易失性内存数据库管理系统。” 2017 年 ACM 国际数据管理会议论文集:1753-1758。https://doi.org/10.1145/3035918.3054780

  26. [ARULRAJ17] Arulraj, J. and A. Pavlo. 2017. “How to Build a Non-Volatile Memory Database Management System.” In Proceedings of the 2017 ACM International Conference on Management of Data: 1753-1758. https://doi.org/10.1145/3035918.3054780.

  27. [ATHANASSOULIS16] Athanassoulis、Manos、Michael S. Kester、Lukas M. Maas、Radu Stoica、Stratos Idreos、Anastasia Ailamaki 和 Mark Callaghan。2016。“设计访问方法:RUM 猜想。” 在国际扩展数据库技术会议(EDBT)中。https://stratos.seas.harvard.edu/files/stratos/files/rum.pdf

  28. [ATHANASSOULIS16] Athanassoulis, Manos, Michael S. Kester, Lukas M. Maas, Radu Stoica, Stratos Idreos, Anastasia Ailamaki, and Mark Callaghan. 2016. “Designing Access Methods: The RUM Conjecture.” In International Conference on Extending Database Technology (EDBT). https://stratos.seas.harvard.edu/files/stratos/files/rum.pdf.

  29. [ATTIYA94]阿蒂亚德、哈吉特和詹妮弗·韦尔奇。1994。“顺序一致性与线性化。” ACM 计算机系统汇刊12,编号。2(五月):91-122。https://doi.org/10.1145/176575.176576

  30. [ATTIYA94] Attiyaand, Hagit and Jennifer L. Welch. 1994. “Sequential consistency versus linearizability.” ACM Transactions on Computer Systems 12, no. 2 (May): 91-122. https://doi.org/10.1145/176575.176576.

  31. [BABAOGLU93] Babaoglu、Ozalp 和 Sam Toueg。1993 年。“理解非阻塞原子承诺。” 技术报告。博洛尼亚大学。

  32. [BABAOGLU93] Babaoglu, Ozalp and Sam Toueg. 1993. “Understanding Non-Blocking Atomic Commitment.” Technical Report. University of Bologna.

  33. [BAILIS14a]彼得·巴利斯。2014。“线性化与串行化。” 高度可用,但很少一致(博客)。2014 年 9 月 24 日。https ://www.bailis.org/blog/linearizability-versus-serializability

  34. [BAILIS14a] Bailis, Peter. 2014. “Linearizability versus Serializability.” Highly Available, Seldom Consistent (blog). September 24, 2014. https://www.bailis.org/blog/linearizability-versus-serializability.

  35. [BAILIS14b] Bailis、Peter、Alan Fekete、Michael J. Franklin、Ali Ghodsi、Joseph M. Hellerstein 和 Ion Stoica。2014。“数据库系统中的协调避免。” VLDB 捐赠论文集8,编号:3(十一月):185-196。https://doi.org/10.14778/2735508.2735509

  36. [BAILIS14b] Bailis, Peter, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. 2014. “Coordination Avoidance in Database Systems.” Proceedings of the VLDB Endowment 8, no. 3 (November): 185-196. https://doi.org/10.14778/2735508.2735509.

  37. [BAILIS14c] Bailis、Peter、Alan Fekete、Ali Ghodsi、Joseph M. Hellerstein 和 Ion Stoica。2014 年。“通过 RAMP 事务实现可扩展的原子可见性。” ACM 数据库系统事务41,no。3(七月)。https://doi.org/10.1145/2909870

  38. [BAILIS14c] Bailis, Peter, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. 2014. “Scalable Atomic Visibility with RAMP Transactions.” ACM Transactions on Database Systems 41, no. 3 (July). https://doi.org/10.1145/2909870.

  39. [BARTLETT16]巴特利特、罗伯特·P. III 和贾斯汀·麦克拉里。2016 年。“股票市场的操纵程度如何?:来自微秒时间戳的证据。” 加州大学伯克利分校公法研究论文。https://doi.org/10.2139/ssrn.2812123

  40. [BARTLETT16] Bartlett, Robert P. III, and Justin McCrary. 2016. “How Rigged Are Stock Markets?: Evidence From Microsecond Timestamps.” UC Berkeley Public Law Research Paper. https://doi.org/10.2139/ssrn.2812123.

  41. [BAUDET19] Baudet、Mathieu、Avery Ching、Andrey Chursin、George Danezis、François Garillot、Zekun Li、Dahlia Malkhi、Oded Naor、Dmitri Perelman 和 Alberto Sonnino。2019。“Libra 区块链中的状态机复制。” https://developers.libra.org/docs/assets/papers/libra-consensus-state-machine-replication-in-the-libra-blockchain.pdf

  42. [BAUDET19] Baudet, Mathieu, Avery Ching, Andrey Chursin, George Danezis, François Garillot, Zekun Li, Dahlia Malkhi, Oded Naor, Dmitri Perelman, and Alberto Sonnino. 2019. “State Machine Replication in the Libra Blockchain.” https://developers.libra.org/docs/assets/papers/libra-consensus-state-machine-replication-in-the-libra-blockchain.pdf.

  43. [BAYER72]拜耳,R.,和 EM 麦克雷特。1972.“大型有序指数的组织和维护。” 信息学报1,没有。3(九月):173-189。https://doi.org/10.1007/BF00288683

  44. [BAYER72] Bayer, R., and E. M. McCreight. 1972. “Organization and maintenance of large ordered indices.” Acta Informatica 1, no. 3 (September): 173-189. https://doi.org/10.1007/BF00288683.

  45. [BEDALY69] Belady,LA,RA Nelson 和 GS Shedler。1969年。“寻呼机中运行的某些程序的时空特性异常。” ACM 12 的通信,编号。6(六月):349-353。https://doi.org/10.1145/363011.363155

  46. [BEDALY69] Belady, L. A., R. A. Nelson, and G. S. Shedler. 1969. “An anomaly in space-time characteristics of certain programs running in a paging machine.” Communications of the ACM 12, no. 6 (June): 349-353. https://doi.org/10.1145/363011.363155.

  47. [BENDER05] Bender、Michael A.、Erik D. Demaine 和 Martin Farach-Colton。2005.“缓存不经意的 B 树。” SIAM 计算杂志35,no。2(八月):341-358。https://doi.org/10.1137/S0097539701389956

  48. [BENDER05] Bender, Michael A., Erik D. Demaine, and Martin Farach-Colton. 2005. “Cache-Oblivious B-Trees.” SIAM Journal on Computing 35, no. 2 (August): 341-358. https://doi.org/10.1137/S0097539701389956.

  49. [BERENSON95]贝伦森、哈尔、菲尔·伯恩斯坦、吉姆·格雷、吉姆·梅尔顿、伊丽莎白·奥尼尔和帕特里克·奥尼尔。1995 年。“对 ANSI SQL 隔离级别的批评。” ACM SIGMOD 记录24,编号。2(五月):1-10。https://doi.org/10.1145/568271.223785

  50. [BERENSON95] Berenson, Hal, Phil Bernstein, Jim Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. 1995. “A critique of ANSI SQL isolation levels.” ACM SIGMOD Record 24, no. 2 (May): 1-10. https://doi.org/10.1145/568271.223785.

  51. [BERNSTEIN87]伯恩斯坦、菲利普 A.、瓦斯科·哈兹拉科斯和内森·古德曼。1987。数据库系统中的并发控制和恢复。波士顿:艾迪生·韦斯利·朗曼。

  52. [BERNSTEIN87] Bernstein, Philip A., Vassco Hadzilacos, and Nathan Goodman. 1987. Concurrency Control and Recovery in Database Systems. Boston: Addison-Wesley Longman.

  53. [BERNSTEIN09]伯恩斯坦、菲利普 A. 和埃里克·纽科默。2009。事务处理原理。旧金山:摩根·考夫曼。

  54. [BERNSTEIN09] Bernstein, Philip A. and Eric Newcomer. 2009. Principles of Transaction Processing. San Francisco: Morgan Kaufmann.

  55. [BHATTACHARJEE17] Bhattacharjee、Abhishek、Daniel Lustig 和 Margaret Martonosi。2017。虚拟内存的架构和操作系统支持。加利福尼亚州圣拉斐尔:Morgan & Claypool Publishers。

  56. [BHATTACHARJEE17] Bhattacharjee, Abhishek, Daniel Lustig, and Margaret Martonosi. 2017. Architectural and Operating System Support for Virtual Memory. San Rafael, CA: Morgan & Claypool Publishers.

  57. [BIRMAN07]伯曼,肯。2007 年。“八卦协议的承诺和局限性。” ACM SIGOPS 操作系统评论41,编号。5(十月):8-13。https://doi.org/10.1145/1317379.1317382

  58. [BIRMAN07] Birman, Ken. 2007. “The promise, and limitations, of gossip protocols.” ACM SIGOPS Operating Systems Review 41, no. 5 (October): 8-13. https://doi.org/10.1145/1317379.1317382.

  59. [BIRMAN10]伯曼,肯。2010 年。复制中的“虚拟同步复制模型的历史” ,由 Bernadette Charron-Bost、Fernando Pedone 和 André Schiper 编辑,91-120。柏林:斯普林格出版社、柏林、海德堡。

  60. [BIRMAN10] Birman, Ken. 2010. “A History of the Virtual Synchrony Replication Model” In Replication, edited by Bernadette Charron-Bost, Fernando Pedone, and André Schiper, 91-120. Berlin: Springer-Verlag, Berlin, Heidelberg.

  61. [BIRMAN06]伯曼、肯、哥印拜陀钱德塞卡兰、丹尼·多列夫、罗伯特·范雷内斯。2006 年。“幕后黑手如何塑造软件可靠性市场。” 在第一届应用软件可靠性研讨会 (WASR 2006)中。IEEE。

  62. [BIRMAN06] Birman, Ken, Coimbatore Chandersekaran, Danny Dolev, Robbert vanRenesse. 2006. “How the Hidden Hand Shapes the Market for Software Reliability.” In First Workshop on Applied Software Reliability (WASR 2006). IEEE.

  63. [BIYIKOGLU13] Biyikoglu,Cihan。2013 年。“幕后:Redis CRDT(无冲突复制数据类型)”。http://lp.redislabs.com/rs/915-NFD-128/images/WP-RedisLabs-Redis-Conflict-free-Replicated-Data-Types.pdf

  64. [BIYIKOGLU13] Biyikoglu, Cihan. 2013. “Under the Hood: Redis CRDTs (Conflict-free Replicated Data Types).” http://lp.redislabs.com/rs/915-NFD-128/images/WP-RedisLabs-Redis-Conflict-free-Replicated-Data-Types.pdf.

  65. [BJØRLING17]比约林、马蒂亚斯、哈维尔·冈萨雷斯和菲利普·博内特。2017 年。“LightNVM:Linux 开放通道 SSD 子系统。” 第 15 届 Usenix 文件和存储技术会议 (FAST'17) 会议记录,359-373。USENIX。

  66. [BJØRLING17] Bjørling, Matias, Javier González, and Philippe Bonnet. 2017. “LightNVM: the Linux open-channel SSD subsystem.” In Proceedings of the 15th Usenix Conference on File and Storage Technologies (FAST’17), 359-373. USENIX.

  67. [BLOOM70] Bloom, Burton H. 1970。“哈希编码中允许错误的空间/时间权衡。” ACM 13 的通信,编号。7(七月):422-426。https://doi.org/10.1145/362686.362692

  68. [BLOOM70] Bloom, Burton H. 1970. “Space/time trade-offs in hash coding with allowable errors.” Communications of the ACM 13, no. 7 (July): 422-426. https://doi.org/10.1145/362686.362692.

  69. [BREWER00]埃里克·布鲁尔。2000 年。“迈向稳健的分布式系统。” 第十九届 ACM 分布式计算原理年度研讨会 (PODC '00) 的会议记录。纽约:计算机协会。https://doi.org/10.1145/343477.343502

  70. [BREWER00] Brewer, Eric. 2000. “Towards robust distributed systems.” Proceedings of the nineteenth annual ACM symposium on Principles of distributed computing (PODC ’00). New York: Association for Computing Machinery. https://doi.org/10.1145/343477.343502.

  71. [BRZEZINSKI03]布热津斯基、Jerzy、Cezary Sobaniec 和 Dariusz Wawrzyniak。2003 年。“会话保证实现复制共享对象的 PRAM 一致性。” 并行处理和应用数学,1-8。柏林:施普林格、柏林、海德堡。

  72. [BRZEZINSKI03] Brzezinski, Jerzy, Cezary Sobaniec, and Dariusz Wawrzyniak. 2003. “Session Guarantees to Achieve PRAM Consistency of Replicated Shared Objects.” In Parallel Processing and Applied Mathematics, 1–8. Berlin: Springer, Berlin, Heidelberg.

  73. [BUCHMAN18]布赫曼、伊森、Jae Kwon 和扎科·米洛舍维奇。2018.“关于 BFT 共识的最新八卦。” https://arxiv.org/pdf/1807.04938.pdf

  74. [BUCHMAN18] Buchman, Ethan, Jae Kwon, and Zarko Milosevic. 2018. “The latest gossip on BFT consensus.” https://arxiv.org/pdf/1807.04938.pdf.

  75. [CACHIN11]卡钦、克里斯蒂安、拉希德·格拉维和路易斯·罗德里格斯。2011。可靠和安全的分布式编程简介(第二版)。纽约:施普林格。

  76. [CACHIN11] Cachin, Christian, Rachid Guerraoui, and Luis Rodrigues. 2011. Introduction to Reliable and Secure Distributed Programming (2nd Ed.). New York: Springer.

  77. [CASTRO99]卡斯特罗,米格尔。和芭芭拉·利斯科夫。1999.“实用拜占庭容错”。在OSDI '99 第三届操作系统设计和实现研讨会论文集,173-186。

  78. [CASTRO99] Castro, Miguel. and Barbara Liskov. 1999. “Practical Byzantine Fault Tolerance.” In OSDI ’99 Proceedings of the third symposium on Operating systems design and implementation, 173-186.

  79. [CESATI05] Cesati、Marco 和 Daniel P. Bovet。2005 年。了解 Linux 内核。第三版。塞瓦斯托波尔:O'Reilly Media, Inc.

  80. [CESATI05] Cesati, Marco, and Daniel P. Bovet. 2005. Understanding the Linux Kernel. Third Edition. Sebastopol: O’Reilly Media, Inc.

  81. [CHAMBERLIN81]张伯伦、唐纳德 D.、莫顿 M. 阿斯特拉汉、迈克尔 W. 布拉斯根、詹姆斯 N. 格雷、W. 弗兰克·金、布鲁斯 G. 林赛、雷蒙德·洛里、詹姆斯 W. 梅尔、托马斯 G. 普莱斯、弗兰科·普佐鲁、帕特里夏·格里菲斯·塞林格、马里奥·施科尔尼克、唐纳德·R·斯卢茨、欧文·L·特雷格、布拉德福德·W·韦德和罗伯特·A·约斯特。1981.“System R 的历史和评估。” ACM 24 的通信,编号。10(十月):632–646。https://doi.org/10.1145/358769.358784

  82. [CHAMBERLIN81] Chamberlin, Donald D., Morton M. Astrahan, Michael W. Blasgen, James N. Gray, W. Frank King, Bruce G. Lindsay, Raymond Lorie, James W. Mehl, Thomas G. Price, Franco Putzolu, Patricia Griffiths Selinger, Mario Schkolnick, Donald R. Slutz, Irving L. Traiger, Bradford W. Wade, and Robert A. Yost. 1981. “A history and evaluation of System R.” Communications of the ACM 24, no. 10 (October): 632–646. https://doi.org/10.1145/358769.358784.

  83. [CHANDRA07]钱德拉、图沙尔 D.、罗伯特·格里塞默和约书亚·雷石东。2007 年。“Paxos 上线:工程视角。” 第二十六届 ACM 分布式计算原理年度研讨会 (PODC '07) 会议记录,398-407。纽约:计算机协会。https://doi.org/10.1145/1281100.1281103

  84. [CHANDRA07] Chandra, Tushar D., Robert Griesemer, and Joshua Redstone. 2007. “Paxos made live: an engineering perspective.” In Proceedings of the twenty-sixth annual ACM symposium on Principles of distributed computing (PODC ’07), 398-407. New York: Association for Computing Machinery. https://doi.org/10.1145/1281100.1281103.

  85. [CHANDRA96]钱德拉、图沙尔·迪帕克和萨姆·图格。1996 年。“可靠的分布式系统的不可靠故障检测器。” ACM 杂志43,no。2(三月):225-267。https://doi.org/10.1145/226643.226647

  86. [CHANDRA96] Chandra, Tushar Deepak, and Sam Toueg. 1996. “Unreliable failure detectors for reliable distributed systems.” Journal of the ACM 43, no. 2 (March): 225-267. https://doi.org/10.1145/226643.226647.

  87. [CHANG79]张、欧内斯特和罗斯玛丽·罗伯茨。1979 年。“一种改进的算法,用于在流程的圆形配置中寻找分散的极值。” ACM 22 的通信,编号。5(五月):281-283。https://doi.org/10.1145/359104.359108

  88. [CHANG79] Chang, Ernest, and Rosemary Roberts. 1979. “An improved algorithm for decentralized extrema-finding in circular configurations of processes.” Communications of the ACM 22, no. 5 (May): 281–283. https://doi.org/10.1145/359104.359108.

  89. [CHANG06] Chang、Fay、Jeffrey Dean、Sanjay Ghemawat、Wilson C. Hsieh、Deborah A.Wallach、Mike Burrows、Tushar Chandra、Andrew Fikes 和 Robert E. Gruber。2006 年。“Bigtable:结构化数据的分布式存储系统。” 第七届USENIX 操作系统设计与实现研讨会 (OSDI '06)。USENIX。

  90. [CHANG06] Chang, Fay, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A.Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. 2006. “Bigtable: A Distributed Storage System for Structured Data.” In 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’06). USENIX.

  91. [CHAZELLE86] Chazelle、Bernard 和 Leonidas J. Guibas。1986。“分段级联,一种数据结构技术。” 算法1:133-162。https://doi.org/10.1007/BF01840440

  92. [CHAZELLE86] Chazelle, Bernard, and Leonidas J. Guibas. 1986. “Fractional Cascading, A Data Structuring Technique.” Algorithmica 1: 133-162. https://doi.org/10.1007/BF01840440.

  93. [CHOCKLER15]乔克勒、格雷戈里和达丽娅·马尔基。2015 年。“具有无限多个进程的活动磁盘 paxos。” 在第二十一届分布式计算原理年度研讨会 (PODC '02) 的会议记录中,78-87。纽约:计算机协会。https://doi.org/10.1145/571825.571837

  94. [CHOCKLER15] Chockler, Gregory, and Dahlia Malkhi. 2015. “Active disk paxos with infinitely many processes.” In Proceedings of the twenty-first annual symposium on Principles of distributed computing (PODC ’02), 78-87. New York: Association for Computing Machinery. https://doi.org/10.1145/571825.571837.

  95. [COMER79]科默,道格拉斯。1979。“无处不在的 B 树。” ACM 计算调查11,编号。2(六月):121-137。https://doi.org/10.1145/356770.356776

  96. [COMER79] Comer, Douglas. 1979. “Ubiquitous B-Tree.” ACM Computing Survey 11, no. 2 (June): 121-137. https://doi.org/10.1145/356770.356776.

  97. [CORBET18]科贝特,乔纳森。2018.“PostgreSQL 的 fsync() 惊喜。” https://lwn.net/Articles/752063

  98. [CORBET18] Corbet, Jonathan. 2018. “PostgreSQL’s fsync() surprise.” https://lwn.net/Articles/752063.

  99. [CORBETT12] Corbett、James C.、Jeffrey Dean、Andrew Fikes、Christopher Frost、JJ Furman、Sanjay Ghemawat、Andrey Gubarev、Christopher Heiser、Peter Hochschild、Wilson Hsieh、Sebastian Kanthak、Eugene Kogan、Li Hongyi、Alexander Lloyd、Sergey Melnik 、大卫·姆瓦拉、大卫·内格尔、肖恩·昆兰、拉杰什·拉奥、林赛·罗利格、斋藤靖、米哈尔·西曼尼亚克、克里斯托弗·泰勒、露丝·王和戴尔·伍德福德。2012 年。“Spanner:Google 的全球分布式数据库。” 第十届 USENIX 操作系统设计与实现研讨会 (OSDI '12),261-264。USENIX。

  100. [CORBETT12] Corbett, James C., Jeffrey Dean, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, and Dale Woodford. 2012. “Spanner: Google’s Globally-Distributed Database.” In 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’12), 261-264. USENIX.

  101. [CORMODE04] Cormode,G. 和 S. Muthukrishnan。2004 年。“改进的数据流摘要:最小计数草图及其应用。” 《算法杂志》 55,第 1 期(四月):58-75。https://doi.org/10.1016/j.jalgor.2003.12.001

  102. [CORMODE04] Cormode, G. and S. Muthukrishnan. 2004. “An improved data stream summary: The count-min sketch and its applications.” Journal of Algorithms 55, No. 1 (April): 58-75. https://doi.org/10.1016/j.jalgor.2003.12.001.

  103. [CORMODE11] Cormode、Graham 和 S. Muthukrishnan。2011.“用计数最小数据结构近似数据。” http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf

  104. [CORMODE11] Cormode, Graham, and S. Muthukrishnan. 2011. “Approximating Data with the Count-Min Data Structure.” http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf.

  105. [CORMODE12] Cormode、Graham 和 Senthilmurugan Muthukrishnan。2012。“用计数最小数据结构近似数据。”

  106. [CORMODE12] Cormode, Graham and Senthilmurugan Muthukrishnan. 2012. “Approximating Data with the Count-Min Data Structure.”

  107. [CHRISTIAN91]克里斯蒂安,黄素。1991。“了解容错分布式系统。” ACM 34 的通信,编号。2(二月):56-78。https://doi.org/10.1145/102792.102801

  108. [CHRISTIAN91] Cristian, Flavin. 1991. “Understanding fault-tolerant distributed systems.” Communications of the ACM 34, no. 2 (February): 56-78. https://doi.org/10.1145/102792.102801.

  109. [DAILY13]每日,约翰。2013 年。“时钟很糟糕,或者,欢迎来到分布式系统的奇妙世界。” 里亚克(博客)。2013 年 11 月 12 日。https ://riak.com/clocks-are-bad-or-welcome-to-distributed-systems

  110. [DAILY13] Daily, John. 2013. “Clocks Are Bad, Or, Welcome to the Wonderful World of Distributed Systems.” Riak (blog). November 12, 2013. https://riak.com/clocks-are-bad-or-welcome-to-distributed-systems.

  111. [DECANDIA07] DeCandia、Giuseppe、Deniz Hastorun、Madan Jampani、Gunavardhan Kakulapati、Avinash Lakshman、Alex Pilchin、Swaminathan Sivasubramanian、Peter Vosshall 和 Werner Vogels。2007 年。“Dynamo:亚马逊的高可用性键值存储。” SIGOPS 操作系统评论41,编号。6(十月):205-220。https://doi.org/10.1145/1323293.1294281

  112. [DECANDIA07] DeCandia, Giuseppe, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. “Dynamo: amazon’s highly available key-value store.” SIGOPS Operating Systems Review 41, no. 6 (October): 205-220. https://doi.org/10.1145/1323293.1294281.

  113. [DECHEV10] Dechev、达米安、Peter Pirkelbauer 和 Bjarne Stroustrup。2010.“理解并有效预防基于描述符的无锁设计中的 ABA 问题。” 2010 年第 13 届 IEEE 国际面向对象/组件/服务的实时分布式计算研讨会 (ISORC '10) 会议记录:185–192。https://doi.org/10.1109/ISORC.2010.10

  114. [DECHEV10] Dechev, Damian, Peter Pirkelbauer, and Bjarne Stroustrup. 2010. “Understanding and Effectively Preventing the ABA Problem in Descriptor-Based Lock-Free Designs.” Proceedings of the 2010 13th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing (ISORC ’10): 185–192. https://doi.org/10.1109/ISORC.2010.10.

  115. [DEMAINE02] Demaine, Erik D. 2002。“缓存忽略算法和数据结构。” 在EEF 暑期学校关于海量数据集的讲义中。丹麦:奥胡斯大学。

  116. [DEMAINE02] Demaine, Erik D. 2002. “Cache-Oblivious Algorithms and Data Structures.” In Lecture Notes from the EEF Summer School on Massive Data Sets. Denmark: University of Aarhus.

  117. [DEMERS87]德默斯、艾伦、丹·格林、卡尔·豪瑟、韦斯·爱尔兰、约翰·拉森、斯科特·申克、霍华德·斯特吉斯、丹·斯温哈特和道格·特里。1987 年。“用于复制数据库维护的流行算法。” 第六届年度 ACM 分布式计算原理研讨会 (PODC '87) 会议记录,1-12。纽约:计算机协会。https://doi.org/10.1145/41840.41841

  118. [DEMERS87] Demers, Alan, Dan Greene, Carl Hauser, Wes Irish, John Larson, Scott Shenker, Howard Sturgis, Dan Swinehart, and Doug Terry. 1987. “Epidemic algorithms for replicated database maintenance.” In Proceedings of the sixth annual ACM Symposium on Principles of distributed computing (PODC ’87), 1-12. New York: Association for Computing Machinery. https://doi.org/10.1145/41840.41841.

  119. [DENNING68] Denning, Peter J. 1968。“程序行为的工作集模型”。ACM 11 的通信,编号。5(五月):323-333。https://doi.org/10.1145/363095.363141

  120. [DENNING68] Denning, Peter J. 1968. “The working set model for program behavior”. Communications of the ACM 11, no. 5 (May): 323-333. https://doi.org/10.1145/363095.363141.

  121. [DIACONU13] Diaconu、Cristian、Craig Freedman、Erik Ismert、Per-Åke Larson、Pravin Mittal、Ryan Stonecipher、Nitin Verma 和 Mike Zwilling。2013 年。“Hekaton:SQL Server 的内存优化 OLTP 引擎。” 2013 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '13) 会议记录,1243-1254。纽约:计算机协会。https://doi.org/10.1145/2463676.2463710

  122. [DIACONU13] Diaconu, Cristian, Craig Freedman, Erik Ismert, Per-Åke Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. 2013. “Hekaton: SQL Server’s Memory-Optimized OLTP Engine.” In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD ’13), 1243-1254. New York: Association for Computing Machinery. https://doi.org/10.1145/2463676.2463710.

  123. [唐尼12]唐尼,吉姆。2012 年。“小心草率的法定人数。” 吉姆·唐尼(博客)。2012 年 3 月 5 日。https: //jimdowney.net/2012/03/05/be-careful-with-sloppy-quorums

  124. [DOWNEY12] Downey, Jim. 2012. “Be Careful with Sloppy Quorums.” Jim Downey (blog). March 5, 2012. https://jimdowney.net/2012/03/05/be-careful-with-sloppy-quorums.

  125. [DREPPER07]德雷珀,乌尔里希。2007.每个程序员都应该了解的内存知识。波士顿:红帽公司

  126. [DREPPER07] Drepper, Ulrich. 2007. What Every Programmer Should Know About Memory. Boston: Red Hat, Inc.

  127. [DUNAGAN04]约翰·杜纳甘、尼古拉斯·JA·哈维、迈克尔·B·琼斯、德扬·科斯蒂奇、马文·泰默和亚历克·沃尔曼。2004 年。“FUSE:轻量级有保证的分布式故障通知。” 第六届操作系统设计与实现研讨会论文集 - 第 6 卷 (OSDI'04),11-11。USENIX。

  128. [DUNAGAN04] Dunagan, John, Nicholas J. A. Harvey, Michael B. Jones, Dejan Kostić, MarvinTheimer, and Alec Wolman. 2004. “FUSE: lightweight guaranteed distributed failure notification.” In Proceedings of the 6th conference on Symposium on Operating Systems Design & Implementation - Volume 6 (OSDI’04), 11-11. USENIX.

  129. [DWORK88]德沃克、辛西娅、南希·林奇和拉里·斯托克迈尔。1988.“在部分同步的情况下达成共识。” ACM 杂志35,no。2(四月):288-323。https://doi.org/10.1145/42282.42283

  130. [DWORK88] Dwork, Cynthia, Nancy Lynch, and Larry Stockmeyer. 1988. “Consensus in the presence of partial synchrony.” Journal of the ACM 35, no. 2 (April): 288-323. https://doi.org/10.1145/42282.42283.

  131. [EINZIGER15] Einziger、吉尔和罗伊·弗里德曼。2015。“基于近似计数的保守更新的正式分析。” 2015年国际计算、网络和通信会议 (ICNC),260-264。IEEE。

  132. [EINZIGER15] Einziger, Gil and Roy Friedman. 2015. “A formal analysis of conservative update based approximate counting.” In 2015 International Conference on Computing, Networking and Communications (ICNC), 260-264. IEEE.

  133. [EINZIGER17] Einziger、吉尔、罗伊·弗里德曼和本·马内斯。2017.“TinyLFU:一种高效的缓存准入策略。” 2014年第 22 届 Euromicro 国际并行、分布式和基于网络的处理会议,146-153。IEEE。

  134. [EINZIGER17] Einziger, Gil, Roy Friedman, and Ben Manes. 2017. “TinyLFU: A Highly Efficient Cache Admission Policy.” In 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, 146-153. IEEE.

  135. [埃利斯11]埃利斯,乔纳森。2011.“理解暗示切换。” Datastax(博客)。2011 年 5 月 31 日。https ://www.datastax.com/dev/blog/understanding-hinted-handoff

  136. [ELLIS11] Ellis, Jonathan. 2011. “Understanding Hinted Handoff.” Datastax (blog). May 31, 2011. https://www.datastax.com/dev/blog/understanding-hinted-handoff.

  137. [埃利斯13]埃利斯,乔纳森。2013。“为什么 Cassandra 不需要矢量时钟。” Datastax(博客)。2013 年 9 月 3 日。https://www.datastax.com/dev/blog/why-cassandra-doesnt-need-vector-clocks

  138. [ELLIS13] Ellis, Jonathan. 2013. “Why Cassandra doesn’t need vector clocks.” Datastax (blog). September 3, 2013. https://www.datastax.com/dev/blog/why-cassandra-doesnt-need-vector-clocks.

  139. [ELMASRI11] Elmasri、Ramez 和 Shamkant Navathe。2011。数据库系统基础(第六版)。波士顿:皮尔逊。

  140. [ELMASRI11] Elmasri, Ramez and Shamkant Navathe. 2011. Fundamentals of Database Systems (6th Ed.). Boston: Pearson.

  141. [FEKETE04]费克特、艾伦、伊丽莎白·奥尼尔和帕特里克·奥尼尔。2004.“快照隔离下的只读事务异常。” ACM SIGMOD 记录33,编号。3(九月):12-14。https://doi.org/10.1145/1031570.1031573

  142. [FEKETE04] Fekete, Alan, Elizabeth O’Neil, and Patrick O’Neil. 2004. “A read-only transaction anomaly under snapshot isolation.” ACM SIGMOD Record 33, no. 3 (September): 12-14. https://doi.org/10.1145/1031570.1031573.

  143. [FISCHER85]费舍尔、迈克尔·J.、南希·A. 林奇和迈克尔·S. 帕特森。1985 年。“不可能通过一个错误的流程达成分布式共识。” ACM 杂志32, 2(四月):374-382。https://doi.org/10.1145/3149.214121

  144. [FISCHER85] Fischer, Michael J., Nancy A. Lynch, and Michael S. Paterson. 1985. “Impossibility of distributed consensus with one faulty process.” Journal of the ACM 32, 2 (April): 374-382. https://doi.org/10.1145/3149.214121.

  145. [FLAJOLET12] Flajolet、Philippe、Eric Fusy、Olivier Gandouet 和 Frédéric Meunier。2012.“HyperLogLog:近乎最优基数估计算法的分析。” AOFA '07:2007 年国际算法分析会议论文集

  146. [FLAJOLET12] Flajolet, Philippe, Eric Fusy, Olivier Gandouet, and Frédéric Meunier. 2012. “HyperLogLog: The analysis of a near-optimal cardinality estimation algorithm.” In AOFA ’07: Proceedings of the 2007 International Conference on Analysis of Algorithms.

  147. [FOWLER11]福勒,马丁。2011.“LMAX 架构。” 马丁·福勒。2011 年 7 月 12 日。https ://martinfowler.com/articles/lmax.html

  148. [FOWLER11] Fowler, Martin. 2011. “The LMAX Architecture.” Martin Fowler. July 12, 2011. https://martinfowler.com/articles/lmax.html.

  149. [FOX99]福克斯、阿曼多和埃里克·A·布鲁尔。1999。“收获、产量和可扩展的耐受系统。” 第七届操作系统热点研讨会论文集,174-178。

  150. [FOX99] Fox, Armando and Eric A. Brewer. 1999. “Harvest, Yield, and Scalable Tolerant Systems.” In Proceedings of the Seventh Workshop on Hot Topics in Operating Systems, 174-178.

  151. [FREILING11] Freiling、Felix C.、Rachid Guerraoui 和 Petr Kuznetsov。2011.“故障检测器抽象。” ACM 计算调查43,编号。2(一月):第 9 条。https://doi.org/10.1145/1883612.1883616

  152. [FREILING11] Freiling, Felix C., Rachid Guerraoui, and Petr Kuznetsov. 2011. “The failure detector abstraction.” ACM Computing Surveys 43, no. 2 (January): Article 9. https://doi.org/10.1145/1883612.1883616.

  153. [MOLINA82] Garcia-Molina, H. 1982。“分布式计算系统中的选举”。IEEE 计算机汇刊31,no。1(一月):48-59。https://dx.doi.org/10.1109/TC.1982.1675885

  154. [MOLINA82] Garcia-Molina, H. 1982. “Elections in a Distributed Computing System.” IEEE Transactions on Computers 31, no. 1 (January): 48-59. https://dx.doi.org/10.1109/TC.1982.1675885.

  155. [MOLINA92] Garcia-Molina, H. 和 K. Salem。1992 年。“主内存数据库系统:概述。” IEEE 知识与数据工程汇刊4,编号。6(12 月):509-516。https://doi.org/10.1109/69.180602

  156. [MOLINA92] Garcia-Molina, H. and K. Salem. 1992. “Main Memory Database Systems: An Overview.” IEEE Transactions on Knowledge and Data Engineering 4, no. 6 (December): 509-516. https://doi.org/10.1109/69.180602.

  157. [MOLINA08]加西亚-莫利纳、赫克托、杰弗里·D·乌尔曼和詹妮弗·维多姆。2008 年。数据库系统:全书(第二版)。波士顿:皮尔逊。

  158. [MOLINA08] Garcia-Molina, Hector, Jeffrey D. Ullman, and Jennifer Widom. 2008. Database Systems: The Complete Book (2nd Ed.). Boston: Pearson.

  159. [GEORGOPOULOS16]乔治普洛斯,乔治斯。2016。“现代 CPU 的内存一致性模型。” https://es.cs.uni-kl.de/publications/datarsg/Geor16.pdf

  160. [GEORGOPOULOS16] Georgopoulos, Georgios. 2016. “Memory Consistency Models of Modern CPUs.” https://es.cs.uni-kl.de/publications/datarsg/Geor16.pdf.

  161. [GHOLIPOUR09] Gholipour、Majid、MS Kordafshari、Mohsen Jahanshahi 和 Amir Masoud Rahmani。2009。“分布式系统中选举算法的新方法。” 2009 年第二届通信理论、可靠性和服务质量国际会议,70-74。IEEE。https://doi.org/10.1109/CTRQ.2009.32

  162. [GHOLIPOUR09] Gholipour, Majid, M. S. Kordafshari, Mohsen Jahanshahi, and Amir Masoud Rahmani. 2009. “A New Approach For Election Algorithm in Distributed Systems.” In 2009 Second International Conference on Communication Theory, Reliability, and Quality of Service, 70-74. IEEE. https://doi.org/10.1109/CTRQ.2009.32.

  163. [GIAMPAOLO98]詹保罗,多米尼克。1998.使用 be 文件系统进行实用文件系统设计。旧金山:摩根·考夫曼。

  164. [GIAMPAOLO98] Giampaolo, Dominic. 1998. Practical File System Design with the be File System. San Francisco: Morgan Kaufmann.

  165. [GILAD17] Gilad、Yossi、Rotem Hemo、Silvio Micali、Georgios Vlachos 和 Nickolai Zeldovich。2017 年。“Algorand:扩展加密货币的拜占庭协议。” 第 26 届操作系统原理研讨会论文集(10 月):51-68。https://doi.org/10.1145/3132747.3132757

  166. [GILAD17] Gilad, Yossi, Rotem Hemo, Silvio Micali, Georgios Vlachos, and Nickolai Zeldovich. 2017. “Algorand: Scaling Byzantine Agreements for Cryptocurrencies.” Proceedings of the 26th Symposium on Operating Systems Principles (October): 51–68. https://doi.org/10.1145/3132747.3132757.

  167. [GILBERT02]吉尔伯特、塞思和南希·林奇。2002 年。“Brewer 猜想以及一致、可用、分区容忍的 Web 服务的可行性。” ACM SIGACT 新闻33,第 1 期 2(六月):51-59。https://doi.org/10.1145/564585.564601

  168. [GILBERT02] Gilbert, Seth and Nancy Lynch. 2002. “Brewer’s conjecture and the feasibility of consistent, available, partition-tolerant web services.” ACM SIGACT News 33, no. 2 (June): 51-59. https://doi.org/10.1145/564585.564601.

  169. [GILBERT12]吉尔伯特、塞思和南希·林奇。2012.“CAP 定理的观点。” 计算机45,没有。2(二月):30-36。https://doi.org/10.1109/MC.2011.389

  170. [GILBERT12] Gilbert, Seth and Nancy Lynch. 2012. “Perspectives on the CAP Theorem.” Computer 45, no. 2 (February): 30-36. https://doi.org/10.1109/MC.2011.389.

  171. [GOMES17]戈麦斯、Victor BF、Martin Kleppmann、Dominic P. Mulligan 和 Alastair R. Beresford。2017 年。“验证分布式系统中的强最终一致性。” ACM 编程语言会议记录1(10 月)。https://doi.org/10.1145/3133933

  172. [GOMES17] Gomes, Victor B. F., Martin Kleppmann, Dominic P. Mulligan, and Alastair R. Beresford. 2017. “Verifying strong eventual consistency in distributed systems.” Proceedings of the ACM on Programming Languages 1 (October). https://doi.org/10.1145/3133933.

  173. [GONÇALVES15]贡萨尔维斯、里卡多、保罗·塞尔吉奥·阿尔梅达、卡洛斯·巴克罗和维克多·丰特。2015。“简洁的服务器范围因果关系管理,以实现最终一致的数据存储。” 在分布式应用程序和可互操作系统中,66-79。柏林:施普林格。

  174. [GONÇALVES15] Gonçalves, Ricardo, Paulo Sérgio Almeida, Carlos Baquero, and Victor Fonte. 2015. “Concise Server-Wide Causality Management for Eventually Consistent Data Stores.” In Distributed Applications and Interoperable Systems, 66-79. Berlin: Springer.

  175. [GOOSSAERT14]古萨特,伊曼纽尔。2014 年。“SSD 编码。” CodeCapsule(博客)。2014 年 2 月 12 日。http ://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents

  176. [GOOSSAERT14] Goossaert, Emmanuel. 2014. “Coding For SSDs.” CodeCapsule (blog). February 12, 2014. http://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents.

  177. [GRAEFE04]格雷夫,戈茨。2004.“编写优化的 B 树。” 收录于第三十届超大型数据库国际会议记录 - 第 30 卷 (VLDB '04),672-683。VLDB 捐赠。

  178. [GRAEFE04] Graefe, Goetz. 2004. “Write-Optimized B-Trees.” In Proceedings of the Thirtieth international conference on Very large data bases - Volume 30 (VLDB ’04), 672-683. VLDB Endowment.

  179. [GRAEFE07]格雷夫,戈茨。2007 年。“B 树索引中的分层锁定。” https://www.semanticscholar.org/paper/Hierarchical-locking-in-B-tree-indexes-Graefe/270669b1eb0d31a99fe99bec67e47e9b11b4553f

  180. [GRAEFE07] Graefe, Goetz. 2007. “Hierarchical locking in B-tree indexes.” https://www.semanticscholar.org/paper/Hierarchical-locking-in-B-tree-indexes-Graefe/270669b1eb0d31a99fe99bec67e47e9b11b4553f.

  181. [GRAEFE10]格雷夫,戈茨。2010。“B 树锁定技术的调查。” ACM 数据库系统事务35,没有。3、(七月)。https://doi.org/10.1145/1806907.1806908

  182. [GRAEFE10] Graefe, Goetz. 2010. “A survey of B-tree locking techniques.” ACM Transactions on Database Systems 35, no. 3, (July). https://doi.org/10.1145/1806907.1806908.

  183. [GRAEFE11]格雷夫,戈茨。2011.“现代 B 树技术。” 数据库基础和趋势3,no。4(四月):203-402。https://doi.org/10.1561/1900000028

  184. [GRAEFE11] Graefe, Goetz. 2011. “Modern B-Tree Techniques.” Foundations and Trends in Databases 3, no. 4 (April): 203-402. https://doi.org/10.1561/1900000028.

  185. [GRAY05]格雷、吉姆和凯瑟琳·范英根。2005 年。“磁盘故障率和错误率的实证测量”。访问日期:2013 年 3 月 4 日。https ://arxiv.org/pdf/cs/0701166.pdf

  186. [GRAY05] Gray, Jim, and Catharine van Ingen. 2005. “Empirical Measurements of Disk Failure Rates and Error Rates.” Accessed March 4, 2013. https://arxiv.org/pdf/cs/0701166.pdf.

  187. [GRAY04]格雷、吉姆和莱斯利·兰波特。2004.“交易提交共识”。ACM 数据库系统事务31,no。1(三月):133-160。https://doi.org/10.1145/1132863.1132867

  188. [GRAY04] Gray, Jim, and Leslie Lamport. 2004. “Consensus on Transaction Commit.” ACM Transactions on Database Systems 31, no. 1 (March): 133-160. https://doi.org/10.1145/1132863.1132867.

  189. [GUERRAOUI07]拉希德·格拉奥伊。2007 年。“重新审视非阻塞原子承诺和共识之间的关系。” 在分布式算法中,87-100。柏林:施普林格、柏林、海德堡。https://doi.org/10.1007/BFb0022140

  190. [GUERRAOUI07] Guerraoui, Rachid. 2007. “Revisiting the relationship between non-blocking atomiccommitment and consensus.” In Distributed Algorithms, 87-100. Berlin: Springer, Berlin, Heidelberg. https://doi.org/10.1007/BFb0022140.

  191. [GUERRAOUI97] Guerraoui、拉希德和安德烈·希佩尔。1997 年。“共识:大误解。” 第六届 IEEE 计算机学会分布式计算系统未来趋势研讨会论文集,183-188。IEEE。

  192. [GUERRAOUI97] Guerraoui, Rachid, and André Schiper. 1997. “Consensus: The Big Misunderstanding.” In Proceedings of the Sixth IEEE Computer Society Workshop on Future Trends of Distributed Computing Systems, 183-188. IEEE.

  193. [GUPTA01] Gupta、Indranil、Tushar D. Chandra 和 Germán S. Goldszmidt。2001 年。“关于可扩展且高效的分布式故障检测器。” 第二十届年度 ACM 分布式计算原理研讨会 (PODC '01) 的会议记录,纽约:计算机协会。https://doi.org/10.1145/383962.384010

  194. [GUPTA01] Gupta, Indranil, Tushar D. Chandra, and Germán S. Goldszmidt. 2001. “On scalable and efficient distributed failure detectors.” In Proceedings of the twentieth annual ACM symposium on Principles of distributed computing (PODC ’01) New York: Association for Computing Machinery. https://doi.org/10.1145/383962.384010.

  195. [HADZILACOS05]哈兹拉科斯,瓦索斯。2005年。“论原子承诺与共识问题之间的关系”。在容错分布式计算中,201-208。伦敦:施普林格出版社。

  196. [HADZILACOS05] Hadzilacos, Vassos. 2005. “On the relationship between the atomic commitment and consensus problems.” In Fault-Tolerant Distributed Computing, 201-208. London: Springer-Verlag.

  197. [HAERDER83]哈德、西奥和安德烈亚斯·路透。1983 年。“面向事务的数据库恢复原理。” ACM 计算调查15 期 4(12 月):287–317。https://doi.org/10.1145/289.291

  198. [HAERDER83] Haerder, Theo, and Andreas Reuter. 1983. “Principles of transaction-oriented database recovery.” ACM Computing Surveys 15 no. 4 (December):287–317. https://doi.org/10.1145/289.291.

  199. [HALE10]黑尔,科达。2010。“你不能牺牲分区容错性。” Coda Hale(博客)。https://codahale.com/you-cant-sacrifice-partition-tolerance

  200. [HALE10] Hale, Coda. 2010. “You Can’t Sacrifice Partition Tolerance.” Coda Hale (blog). https://codahale.com/you-cant-sacrifice-partition-tolerance.

  201. [HALPERN90] Halpern,约瑟夫 Y.,和约拉姆·摩西。1990。“分布式环境中的知识和常识。” ACM 杂志37,no。3(七月):549-587。https://doi.org/10.1145/79147.79161

  202. [HALPERN90] Halpern, Joseph Y., and Yoram Moses. 1990. “Knowledge and common knowledge in a distributed environment.” Journal of the ACM 37, no. 3 (July): 549-587. https://doi.org/10.1145/79147.79161.

  203. [HARDING17]哈丁、Rachael、Dana Van Aken、Andrew Pavlo 和 Michael Stonebraker。2017。“分布式并发控制的评估”。VLDB 捐赠论文集10,编号:5(一月):553-564。https://doi.org/10.14778/3055540.3055548

  204. [HARDING17] Harding, Rachael, Dana Van Aken, Andrew Pavlo, and Michael Stonebraker. 2017. “An Evaluation of Distributed Concurrency Control.” Proceedings of the VLDB Endowment 10, no. 5 (January): 553-564. https://doi.org/10.14778/3055540.3055548.

  205. [HAYASHIBARA04] Hayashibara、N.、X. Defago、R.Yared 和 T. Katayama。2004.“Φ 应计故障检测器”。IEEE 可靠分布式系统研讨会,66-78。https://doi.org/10.1109/RELDIS.2004.1353004

  206. [HAYASHIBARA04] Hayashibara, N., X. Defago, R.Yared, and T. Katayama. 2004. “The Φ Accrual Failure Detector.” In IEEE Symposium on Reliable Distributed Systems, 66-78. https://doi.org/10.1109/RELDIS.2004.1353004.

  207. [HELLAND15]海兰,帕特。2015 年。“不变性改变了一切。” 13号队列,没有。9(十一月)。https://doi.org/10.1145/2857274.2884038

  208. [HELLAND15] Helland, Pat. 2015. “Immutability Changes Everything.” Queue 13, no. 9 (November). https://doi.org/10.1145/2857274.2884038.

  209. [HELLERSTEIN07] Hellerstein、Joseph M.、Michael Stonebraker 和 James Hamilton。2007。“数据库系统的体系结构。” 数据库基础和趋势1,no。2(二月):141-259。https://doi.org/10.1561/1900000002

  210. [HELLERSTEIN07] Hellerstein, Joseph M., Michael Stonebraker, and James Hamilton. 2007. “Architecture of a Database System.” Foundations and Trends in Databases 1, no. 2 (February): 141-259. https://doi.org/10.1561/1900000002.

  211. [HERLIHY94]莫里斯·赫利希。1994.“无等待同步。” ACM 编程语言和系统汇刊13,编号。1(一月):124-149。http://dx.doi.org/10.1145/114005.102808

  212. [HERLIHY94] Herlihy, Maurice. 1994. “Wait-Free Synchronization.” ACM Transactions on Programming Languages and Systems 13, no. 1 (January): 124-149. http://dx.doi.org/10.1145/114005.102808.

  213. [HERLIHY10] Herlihy、Maurice、Yossi Lev、Victor Luchangco 和 Nir ​​Shavit。2010.“可证明正确的可扩展并发跳过列表。” https://www.cs.tau.ac.il/~shanir/nir-pubs-web/Papers/OPODIS2006-BA.pdf

  214. [HERLIHY10] Herlihy, Maurice, Yossi Lev, Victor Luchangco, and Nir Shavit. 2010. “A Provably Correct Scalable Concurrent Skip List.” https://www.cs.tau.ac.il/~shanir/nir-pubs-web/Papers/OPODIS2006-BA.pdf.

  215. [HERLIHY90] Herlihy、Maurice P. 和 Jeannette M. Wing。1990.“线性化:并发对象的正确性条件。” ACM 编程语言和系统汇刊12,编号。3(七月):463-492。https://doi.org/10.1145/78969.78972

  216. [HERLIHY90] Herlihy, Maurice P., and Jeannette M. Wing. 1990. “Linearizability: a correctness condition for concurrent object.” ACM Transactions on Programming Languages and Systems 12, no. 3 (July): 463-492. https://doi.org/10.1145/78969.78972.

  217. [HOWARD14]霍华德,海蒂。2014.“ARC:Raft 共识分析”。技术报告 UCAM-CL-TR-857。剑桥:剑桥大学

  218. [HOWARD14] Howard, Heidi. 2014. “ARC: Analysis of Raft Consensus.” Technical Report UCAM-CL-TR-857. Cambridge: University of Cambridge

  219. [HOWARD16]霍华德、海蒂、达丽娅·马尔基和亚历山大·斯皮格曼。2016.“灵活的 Paxos:重新审视 Quorum 交叉。” https://arxiv.org/abs/1608.06696

  220. [HOWARD16] Howard, Heidi, Dahlia Malkhi, and Alexander Spiegelman. 2016. “Flexible Paxos: Quorum intersection revisited.” https://arxiv.org/abs/1608.06696.

  221. [HOWARD19]霍华德、海蒂和理查德·莫蒂尔。2019。“分布式共识的通用解决方案。” https://arxiv.org/abs/1902.06776

  222. [HOWARD19] Howard, Heidi, and Richard Mortier. 2019. “A Generalised Solution to Distributed Consensus.” https://arxiv.org/abs/1902.06776.

  223. [HUNT10]帕特里克·亨特、马哈德夫·科纳尔、弗拉维奥·P·容奎拉和本杰明·里德。2010.“ZooKeeper:互联网规模系统的无等待协调。” 在USENIX 年度技术会议 (USENIXATC'10) 2010 USENIX 会议记录中,11。USENIX。

  224. [HUNT10] Hunt, Patrick, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. “ZooKeeper: wait-free coordination for internet-scale systems.” In Proceedings of the 2010 USENIX conference on USENIX annual technical conference (USENIXATC’10), 11. USENIX.

  225. [INTEL14]英特尔公司。2014 年。“英特尔® 固态盘的分区对齐可实现最大性能和耐用性。” (二月)。https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/ssd-partition-alignment-tech-brief.pdf

  226. [INTEL14] Intel Corporation. 2014. “Partition Alignment of Intel® SSDs for Achieving Maximum Performance and Endurance.” (February). https://www.intel.com/content/dam/www/public/us/en/documents/technology-briefs/ssd-partition-alignment-tech-brief.pdf.

  227. [JELASITY04] Jelasity、马克、Rachid Guerraoui、Anne-Marie Kermarrec 和 Maarten van Steen。2004 年。“同行抽样服务:基于非结构化八卦实现的实验评估。” 收录于Middleware '04 第五届 ACM/IFIP/USENIX 国际中间件会议论文集,79-98。柏林:斯普林格出版社、柏林、海德堡。

  228. [JELASITY04] Jelasity, Márk, Rachid Guerraoui, Anne-Marie Kermarrec, and Maarten van Steen. 2004. “The Peer Sampling Service: Experimental Evaluation of Unstructured Gossip-Based Implementations.” In Middleware ’04 Proceedings of the 5th ACM/IFIP/USENIX international conference on Middleware, 79-98. Berlin: Springer-Verlag, Berlin, Heidelberg.

  229. [JELASITY07] Jelasity、Márk、Spyros Voulgaris、Rachid Guerraoui、Anne-Marie Kermarrec 和 Maarten van Steen。2007.“基于八卦的同行抽样。” ACM 计算机系统汇刊25,编号。3(八月)。http://doi.org/10.1145/1275517.1275520

  230. [JELASITY07] Jelasity, Márk, Spyros Voulgaris, Rachid Guerraoui, Anne-Marie Kermarrec, and Maarten van Steen. 2007. “Gossip-based Peer Sampling.” ACM Transactions on Computer Systems 25, no. 3 (August). http://doi.org/10.1145/1275517.1275520.

  231. [JONSON94]约翰逊、西奥多和丹尼斯·沙沙。1994 年。“2Q:低开销高性能缓冲区管理替换算法。“ VL​​DB '94 第 20 届超大型数据库国际会议论文集,439-450。旧金山:Morgan Kaufmann。

  232. [JONSON94] Johnson, Theodore, and Dennis Shasha. 1994. “2Q: A Low Overhead High Performance Buffer Management Replacement Algorithm. " In VLDB ’94 Proceedings of the 20th International Conference on Very Large Data Bases, 439-450. San Francisco: Morgan Kaufmann.

  233. [JUNQUEIRA07] Junqueira、Flavio、Yanhua Mao 和 Keith Marzullo。2007 年。“经典 Paxos 与快速 Paxos:买者自负”。在第三届系统可靠性热门话题研讨会 (HotDep'07) 的会议记录中。USENIX。

  234. [JUNQUEIRA07] Junqueira, Flavio, Yanhua Mao, and Keith Marzullo. 2007. “Classic Paxos vs. fast Paxos: caveat emptor.” In Proceedings of the 3rd workshop on on Hot Topics in System Dependability (HotDep’07). USENIX.

  235. [JUNQUEIRA11] Junqueira、Flavio P.、Benjamin C. Reed 和 Marco Serafini。2011.“Zab:主备份系统的高性能广播。” 2011 年 IEEE/IFIP 第 41 届可靠系统与网络 (DSN) 国际会议(6 月):245–256。https://doi.org/10.1109/DSN.2011.5958223

  236. [JUNQUEIRA11] Junqueira, Flavio P., Benjamin C. Reed, and Marco Serafini. 2011. “Zab: High-performance broadcast for primary-backup systems.” 2011 IEEE/IFIP 41st International Conference on Dependable Systems & Networks (DSN) (June): 245–256. https://doi.org/10.1109/DSN.2011.5958223.

  237. [KANNAN18] Kannan、Sudarsun、Nitish Bhat、Ada Gavrilovska、Andrea Arpaci-Dusseau 和 Remzi Arpaci-Dusseau。2018.“使用 NoveLSM 重新设计用于非易失性存储器的 LSM。” 在USENIX ATC '18 2018 USENIX 会议记录中,Usenix 年度技术会议,993-1005。USENIX。

  238. [KANNAN18] Kannan, Sudarsun, Nitish Bhat, Ada Gavrilovska, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau. 2018. “Redesigning LSMs for Nonvolatile Memory with NoveLSM.” In USENIX ATC ’18 Proceedings of the 2018 USENIX Conference on Usenix Annual Technical Conference, 993-1005. USENIX.

  239. [KARGER97] Karger, D.、E. Lehman、T. Leighton、R. Panigrahy、M. Levine 和 D. Lewin。1997 年。“一致的哈希和随机树:用于缓解万维网上热点的分布式缓存协议。” 在STOC '97 第二十九届年度 ACM 计算理论研讨会论文集 ,654-663。纽约:计算机协会。

  240. [KARGER97] Karger, D., E. Lehman, T. Leighton, R. Panigrahy, M. Levine, and D. Lewin. 1997. “Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the World Wide Web.” In STOC ’97 Proceedings of the twenty-ninth annual ACM symposium on Theory of computing , 654-663. New York: Association for Computing Machinery.

  241. [科尔尼17]科尔尼,乔。2017.“两阶段提交一位老朋友。” 乔的Mots(博客)。2017 年 1 月 6 日。https: //www.joekearney.co.uk/posts/two-phase-commit

  242. [KEARNEY17] Kearney, Joe. 2017. “Two Phase Commit an old friend.” Joe’s Mots (blog). January 6, 2017. https://www.joekearney.co.uk/posts/two-phase-commit.

  243. [KEND94] Kendall、Samuel C.、Jim Waldo、Ann Wollrath 和 Geoff Wyant。1994 年。“分布式计算注释”。技术报告。加利福尼亚州山景城:Sun Microsystems, Inc.

  244. [KEND94] Kendall, Samuel C., Jim Waldo, Ann Wollrath, and Geoff Wyant. 1994. “A Note on Distributed Computing.” Technical Report. Mountain View, CA: Sun Microsystems, Inc.

  245. [KREMARREC07] Kermarrec、Anne-Marie 和 Maarten van Steen。2007 年。“分布式系统中的八卦。” SIGOPS 操作系统评论41,编号。5(十月):2-7。https://doi.org/10.1145/1317379.1317381

  246. [KREMARREC07] Kermarrec, Anne-Marie, and Maarten van Steen. 2007. “Gossiping in distributed systems.” SIGOPS Operating Systems Review 41, no. 5 (October): 2-7. https://doi.org/10.1145/1317379.1317381.

  247. [KERRISK10]克里克,迈克尔。2010。Linux编程接口。旧金山:没有淀粉压榨机。

  248. [KERRISK10] Kerrisk, Michael. 2010. The Linux Programming Interface. San Francisco: No Starch Press.

  249. [KHANCHANDANI18] Khanchandani、Pankaj 和 Roger Wattenhofer。2018.“将比较和交换减少到共识第一​​原语。” https://arxiv.org/abs/1802.03844

  250. [KHANCHANDANI18] Khanchandani, Pankaj, and Roger Wattenhofer. 2018. “Reducing Compare-and-Swap to Consensus Number One Primitives.” https://arxiv.org/abs/1802.03844.

  251. [KIM12] Kim、Jaehong、Sangwon Seo、Dawoon Jung、Jin-Soo Kim 和 Jaehyuk Huh。2012 年。“固态磁盘 (SSD) 的参数感知 I/O 管理。” IEEE 计算机汇刊61,编号。5(五月):636-649。https://doi.org/10.1109/TC.2011.76

  252. [KIM12] Kim, Jaehong, Sangwon Seo, Dawoon Jung, Jin-Soo Kim, and Jaehyuk Huh. 2012. “Parameter-Aware I/O Management for Solid State Disks (SSDs).” IEEE Transactions on Computers 61, no. 5 (May): 636-649. https://doi.org/10.1109/TC.2011.76.

  253. [KINGSBURY18a]金斯伯里,凯尔。2018.“顺序一致性。” https://jepsen.io/consistency/models/sequential。2018.

  254. [KINGSBURY18a] Kingsbury, Kyle. 2018. “Sequential Consistency.” https://jepsen.io/consistency/models/sequential. 2018.

  255. [KINGSBURY18b]金斯伯里,凯尔。2018。“强一致性模型。” 阿菲尔(博客)。2018 年 8 月 8 日。https ://aphyr.com/posts/313-strong-consistency-models

  256. [KINGSBURY18b] Kingsbury, Kyle. 2018. “Strong consistency models.” Aphyr (blog). August 8, 2018. https://aphyr.com/posts/313-strong-consistency-models.

  257. [KLEPPMANN15]克莱普曼,马丁。2015年。“请停止称呼数据库为CP或AP。” 马丁·克莱普曼(博客)。2015 年 5 月 11 日。https ://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html

  258. [KLEPPMANN15] Kleppmann, Martin. 2015. “Please stop calling databases CP or AP.” Martin Kleppmann (blog). May 11, 2015. https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html.

  259. [KLEPPMANN14] Kleppmann、Martin 和 Alastair R. Beresford。2014 年。“无冲突的复制 JSON 数据类型。” https://arxiv.org/abs/1608.03960

  260. [KLEPPMANN14] Kleppmann, Martin, and Alastair R. Beresford. 2014. “A Conflict-Free Replicated JSON Datatype.” https://arxiv.org/abs/1608.03960.

  261. [KNUTH97] Knuth, Donald E. 1997。计算机编程艺术,第 1 卷(第 3 版):基本算法。波士顿:艾迪生·韦斯利·朗曼。

  262. [KNUTH97] Knuth, Donald E. 1997. The Art of Computer Programming, Volume 1 (3rd Ed.): Fundamental Algorithms. Boston: Addison-Wesley Longman.

  263. [KNUTH98] Knuth, Donald E. 1998。计算机编程艺术,第 3 卷:(第 2 版):排序和搜索。波士顿:艾迪生·韦斯利·朗曼。

  264. [KNUTH98] Knuth, Donald E. 1998. The Art of Computer Programming, Volume 3: (2nd Ed.): Sorting and Searching. Boston: Addison-Wesley Longman.

  265. [KOOPMAN15]菲利普·库普曼、凯文·R·德里斯科尔和布伦丹·霍尔。2015。“选择循环冗余码和校验和算法以确保关键数据完整性。” 美国运输部联邦航空管理局https://www.faa.gov/aircraft/air_cert/design_approvals/air_software/media/TC-14–49.pdf

  266. [KOOPMAN15] Koopman, Philip, Kevin R. Driscoll, and Brendan Hall. 2015. “Selection of Cyclic Redundancy Code and Checksum Algorithms to Ensure Critical Data Integrity.” U.S. Department of Transportation Federal Aviation Administration. https://www.faa.gov/aircraft/air_cert/design_approvals/air_software/media/TC-14–49.pdf.

  267. [KORDAFSHARI05] Kordafshari,MS,M. Gholipour,M. Mosakhani,AT Haghighat 和 M. Dehghan。2005.“分布式系统中改进的恶霸选举算法。” 第九届 WSEAS 国际计算机会议 (ICCOMP'05) 会议记录,由 Nikos E. Mastorakis 编辑,第 10 条。 Stevens Point:世界科学与工程学院和学会。

  268. [KORDAFSHARI05] Kordafshari, M. S., M. Gholipour, M. Mosakhani, A. T. Haghighat, and M. Dehghan. 2005. “Modified bully election algorithm in distributed systems.” Proceedings of the 9th WSEAS International Conference on Computers (ICCOMP’05), edited by Nikos E. Mastorakis, Article 10. Stevens Point: World Scientific and Engineering Academy and Society.

  269. [KRASKA18] Kraska、Time、Alex Beutel、Ed H. Chi、Jeffrey Dean 和 Neoklis Polyzotis。2018.“学习索引结构的案例。” 2018 年国际数据管理会议 SIGMOD '18 会议记录,489-504。纽约:计算机协会。

  270. [KRASKA18] Kraska, Time, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. “The Case for Learned Index Structures.” In SIGMOD ’18 Proceedings of the 2018 International Conference on Management of Data, 489-504. New York: Association for Computing Machinery.

  271. [LAMPORT77]兰波特,莱斯利。1977 年。“证明多进程程序的正确性。” IEEE 软件工程汇刊3,编号。2(三月):125-143。https://doi.org/10.1109/TSE.1977.229904

  272. [LAMPORT77] Lamport, Leslie. 1977. “Proving the Correctness of Multiprocess Programs.” IEEE Transactions on Software Engineering 3, no. 2 (March): 125-143. https://doi.org/10.1109/TSE.1977.229904.

  273. [LAMPORT78]兰波特,莱斯利。1978 年。“分布式系统中的时间、时钟和事件顺序”。ACM 21 的通信,编号。7(七月):558-565

  274. [LAMPORT78] Lamport, Leslie. 1978. “Time, Clocks, and the Ordering of Events in a Distributed System.” Communications of the ACM 21, no. 7 (July): 558-565

  275. [LAMPORT79]兰波特,莱斯利。1979 年。“如何制造能够正确执行多进程程序的多处理器计算机。” IEEE 计算机汇刊28,no。9(九月):690-691。https://doi.org/10.1109/TC.1979.1675439

  276. [LAMPORT79] Lamport, Leslie. 1979. “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.” IEEE Transactions on Computers 28, no. 9 (September): 690-691. https://doi.org/10.1109/TC.1979.1675439.

  277. [LAMPORT98]兰波特,莱斯利。1998年。“兼职议会”。ACM 计算机系统汇刊16,编号。2(五月):133-169。https://doi.org/10.1145/279227.279229

  278. [LAMPORT98] Lamport, Leslie. 1998. “The part-time parliament.” ACM Transactions on Computer Systems 16, no. 2 (May): 133-169. https://doi.org/10.1145/279227.279229.

  279. [LAMPORT01]兰波特,莱斯利。2001 年。“Paxos 变得简单。” ACM SIGACT 新闻(分布式计算专栏) 32,no。4(12 月):51-58。https://www.microsoft.com/en-us/research/publication/paxos-made-simple

  280. [LAMPORT01] Lamport, Leslie. 2001. “Paxos Made Simple.” ACM SIGACT News (Distributed Computing Column) 32, no. 4 (December): 51-58. https://www.microsoft.com/en-us/research/publication/paxos-made-simple.

  281. [LAMPORT05]兰波特,莱斯利。2005 年。“广义共识和 Paxos。” https://www.microsoft.com/en-us/research/publication/generalized-consensus-and-paxos

  282. [LAMPORT05] Lamport, Leslie. 2005. “Generalized Consensus and Paxos.” https://www.microsoft.com/en-us/research/publication/generalized-consensus-and-paxos.

  283. [LAMPORT06]兰波特,莱斯利。2006.“快速 Paxos”。分布式计算19,没有。2(七月):79-103。https://doi.org/10.1007/s00446-006-0005-x

  284. [LAMPORT06] Lamport, Leslie. 2006. “Fast Paxos.” Distributed Computing 19, no. 2 (July): 79-103. https://doi.org/10.1007/s00446-006-0005-x.

  285. [LAMPORT09] Lamport、Leslie、Dahlia Malkhi 和 Lidong Zhou。2009。“垂直 Paxos 和主备复制。” 在PODC '09 第 28 届 ACM 分布式计算原理研讨会论文集,312-313。 https://doi.org/10.1145/1582716.1582783

  286. [LAMPORT09] Lamport, Leslie, Dahlia Malkhi, and Lidong Zhou. 2009. “Vertical Paxos and Primary-Backup Replication.” In PODC ’09 Proceedings of the 28th ACM symposium on Principles of distributed computing, 312-313. https://doi.org/10.1145/1582716.1582783.

  287. [兰普森01]兰普森,巴特勒。2001.“Paxos 的 ABCD”。在PODC '01 第二十届年度 ACM 分布式计算原理研讨会论文集,13。https ://doi.org/10.1145/383962.383969

  288. [LAMPSON01] Lampson, Butler. 2001. “The ABCD’s of Paxos.” In PODC ’01 Proceedings of the twentieth annual ACM symposium on Principles of distributed computing, 13. https://doi.org/10.1145/383962.383969.

  289. [LAMPSON79] Lampson、Butler W. 和 Howard E. 1979。“分布式数据存储系统中的崩溃恢复”。https://www.microsoft.com/en-us/research/publication/crash-recovery-in-a-distributed-data-storage-system'

  290. [LAMPSON79] Lampson, Butler W., and Howard E. 1979. “Crash Recovery in a Distributed Data Storage System.” https://www.microsoft.com/en-us/research/publication/crash-recovery-in-a-distributed-data-storage-system’.

  291. [LARRIVEE15]拉里维,史蒂夫。2015 年。“固态硬盘入门”。仙人掌技术(博客)。2015 年 2 月 9 日。https: //www.cactus-tech.com/resources/blog/details/solid-state-drive-primer-1-the-basic-nand-flash-cell

  292. [LARRIVEE15] Larrivee, Steve. 2015. “Solid State Drive Primer.” Cactus Technologies (blog). February 9th, 2015. https://www.cactus-tech.com/resources/blog/details/solid-state-drive-primer-1-the-basic-nand-flash-cell.

  293. [LARSON81] Larson、Per-Åke 和 Åbo Akedemi。1981.“具有溢出链接的索引顺序文件分析”。数据库系统上的 ACM 事务。6、没有。4(12 月):671-680。https://doi.org/10.1145/319628.319665

  294. [LARSON81] Larson, Per-Åke, and Åbo Akedemi. 1981. “Analysis of index-sequential files with overflow chaining”. ACM Transactions on Database Systems. 6, no. 4 (December): 671-680. https://doi.org/10.1145/319628.319665.

  295. [LEE15] Lee、Collin、Seo Jin Park、Ankita Kejriwal、Satoshi Matsushita 和 John Ousterhout。2015。“大规模和低延迟地实现线性化。” 在SOSP '15 第 25 届操作系统原理研讨会论文集,71-86。https://doi.org/10.1145/2815400.2815416

  296. [LEE15] Lee, Collin, Seo Jin Park, Ankita Kejriwal, Satoshi Matsushita, and John Ousterhout. 2015. “Implementing linearizability at large scale and low latency.” In SOSP ’15 Proceedings of the 25th Symposium on Operating Systems Principles, 71-86. https://doi.org/10.1145/2815400.2815416.

  297. [LEHMAN81]雷曼,菲利普 L.,和 s。冰瑶. 1981.“B 树并发操作的高效锁定。” 数据库系统上的 ACM 事务6,编号。4(12 月):650-670。https://doi.org/10.1145/319628.319663

  298. [LEHMAN81] Lehman, Philip L., and s. Bing Yao. 1981. “Efficient locking for concurrent operations on B-trees.” ACM Transactions on Database Systems 6, no. 4 (December): 650-670. https://doi.org/10.1145/319628.319663.

  299. [LEITAO07] Leitao、若昂、何塞·佩雷拉和路易斯·罗德里格斯。2007.“流行病广播树”。SRDS '07 第 26 届 IEEE 国际可靠分布式系统研讨会论文集,301-310。IEEE。

  300. [LEITAO07] Leitao, Joao, Jose Pereira, and Luis Rodrigues. 2007. “Epidemic Broadcast Trees.” In SRDS ’07 Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems, 301-310. IEEE.

  301. [LEVANDOSKI14] Levandoski、Justin J.、David B. Lomet 和 Sudipta Sengupta。2013 年。“Bw-Tree:用于新硬件平台的 B 树。” 2013 年 IEEE 国际数据工程会议 (ICDE '13) 会议记录,302-313。IEEE。https://doi.org/10.1109/ICDE.2013.6544834

  302. [LEVANDOSKI14] Levandoski, Justin J., David B. Lomet, and Sudipta Sengupta. 2013. “The Bw-Tree: A B-tree for new hardware platforms.” In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE ’13), 302-313. IEEE. https://doi.org/10.1109/ICDE.2013.6544834.

  303. [LI10]李一楠,何兵生,杨俊,罗琼,易柯。2010。“固态硬盘上的树索引。” VLDB 捐赠论文集3,编号:1-2(九月):1195-1206。https://doi.org/10.14778/1920841.1920990

  304. [LI10] Li, Yinan, Bingsheng He, Robin Jun Yang, Qiong Luo, and Ke Yi. 2010. “Tree Indexing on Solid State Drives.” Proceedings of the VLDB Endowment 3, no. 1-2 (September): 1195-1206. https://doi.org/10.14778/1920841.1920990.

  305. [LIPTON88]理查德·J·利普顿和乔纳森·S·桑德伯格。1988 年。“PRAM:可扩展的共享内存。” 技术报告,普林斯顿大学。https://www.cs.princeton.edu/research/techreps/TR-180-88

  306. [LIPTON88] Lipton, Richard J., and Jonathan S. Sandberg. 1988. “PRAM: A scalable shared memory.” Technical Report, Princeton University. https://www.cs.princeton.edu/research/techreps/TR-180-88.

  307. [LLOYD11] Lloyd, W.、MJ Freedman、M. Kaminsky 和 ​​DG Andersen。2011 年。“不要满足于最终结果:使用 COPS 实现广域存储的可扩展因果一致性。” 第二十三届 ACM 操作系统原理研讨会论文集 (SOSP '11),401-416。纽约:计算机协会。https://doi.org/10.1145/2043556.2043593

  308. [LLOYD11] Lloyd, W., M. J. Freedman, M. Kaminsky, and D. G. Andersen. 2011. “Don’t settle for eventual: scalable causal consistency for wide-area storage with COPS.” In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles (SOSP ’11), 401-416. New York: Association for Computing Machinery. https://doi.org/10.1145/2043556.2043593.

  309. [LLOYD13] Lloyd, W.、MJ Freedman、M. Kaminsky 和 ​​DG Andersen。2013 年。“低延迟异地复制存储的更强语义。” 第十届 USENIX 网络系统设计与实现研讨会 (NSDI '13),313-328。USENIX。

  310. [LLOYD13] Lloyd, W., M. J. Freedman, M. Kaminsky, and D. G. Andersen. 2013. “Stronger semantics for low-latency geo-replicated storage.” In 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’13), 313-328. USENIX.

  311. [LU16] Lu, Lanyue, Thanumalayan Sankaranarayana Pillai, Hariharan Gopalakrishnan, Andrea C. Arpaci-Dusseau 和 Remzi H. Arpaci-Dusseau。2017.“WiscKey:在 SSD 意识存储中将键与值分离。” ACM 存储事务 (TOS) 13,编号。1(3 月):第 5 条。https://doi.org/10.1145/3033273

  312. [LU16] Lu, Lanyue, Thanumalayan Sankaranarayana Pillai, Hariharan Gopalakrishnan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. “WiscKey: Separating Keys from Values in SSD-Conscious Storage.” ACM Transactions on Storage (TOS) 13, no. 1 (March): Article 5. https://doi.org/10.1145/3033273.

  313. [MATTERN88]马特恩,弗里德曼。1988。“分布式系统的虚拟时间和全局状态。” http://courses.csail.mit.edu/6.852/01/papers/VirtTime_GlobState.pdf

  314. [MATTERN88] Mattern, Friedemann. 1988. “Virtual Time and Global States of Distributed Systems.” http://courses.csail.mit.edu/6.852/01/papers/VirtTime_GlobState.pdf.

  315. [MCKENNEY05a] McKenney, Paul E. 2005。“现代微处理器中的内存排序,第一部分。” Linux 期刊编号 136(8 月):2。

  316. [MCKENNEY05a] McKenney, Paul E. 2005. “Memory Ordering in Modern Microprocessors, Part I.” Linux Journal no. 136 (August): 2.

  317. [MCKENNEY05b] McKenney, Paul E. 2005。“现代微处理器中的内存排序,第二部分。” Linux 期刊编号 137(九月):5。

  318. [MCKENNEY05b] McKenney, Paul E. 2005. “Memory Ordering in Modern Microprocessors, Part II.” Linux Journal no. 137 (September): 5.

  319. [MEHTA17]梅塔、阿普尔瓦和贾森·古斯塔夫森。2017 年。“Apache Kafka 中的事务。” 汇合(博客)。2017 年 11 月 17 日。https ://www.confluence.io/blog/transactions-apache-kafka

  320. [MEHTA17] Mehta, Apurva, and Jason Gustafson. 2017. “Transactions in Apache Kafka.” Confluent (blog). November 17, 2017. https://www.confluent.io/blog/transactions-apache-kafka.

  321. [MELLORCRUMMEY91] Mellor-Crummey,约翰·M. 和迈克尔·L. 斯科特。1991。“共享内存多处理器上可扩展同步的算法。” ACM 计算机系统汇刊9,编号。1(二月):21-65。https://doi.org/10.1145/103727.103729

  322. [MELLORCRUMMEY91] Mellor-Crummey, John M., and Michael L. Scott. 1991. “Algorithms for scalable synchronization on shared-memory multiprocessors.” ACM Transactions on Computer Systems 9, no. 1 (February): 21-65. https://doi.org/10.1145/103727.103729.

  323. [MELTON06]梅尔顿,吉姆。2006年。“数据库语言SQL”。国际标准化组织 (ISO),105–132。柏林:施普林格。https://doi.org/10.1007/b137905

  324. [MELTON06] Melton, Jim. 2006. “Database Language SQL.” In International Organization for Standardization (ISO), 105–132. Berlin: Springer. https://doi.org/10.1007/b137905.

  325. [MERKLE87] Merkle, Ralph C. 1987。“基于传统加密函数的数字签名。” 关于密码学进展的密码技术理论与应用会议 (CRYPTO '87),由 Carl Pomerance 编辑。伦敦:施普林格出版社,369-378。https://dl.acm.org/itation.cfm?id=704751

  326. [MERKLE87] Merkle, Ralph C. 1987. “A Digital Signature Based on a Conventional Encryption Function.” A Conference on the Theory and Applications of Cryptographic Techniques on Advances in Cryptology (CRYPTO ’87), edited by Carl Pomerance. London: Springer-Verlag, 369–378. https://dl.acm.org/citation.cfm?id=704751.

  327. [MILLER78] Miller, R. 和 L. Snyder。1978.“对 B 树的多重访问。” 信息科学与系统会议论文集,巴尔的摩:约翰·霍普金斯大学(三月)。

  328. [MILLER78] Miller, R., and L. Snyder. 1978. “Multiple access to B-trees.” Proceedings of the Conference on Information Sciences and Systems, Baltimore: Johns Hopkins University (March).

  329. [MILOSEVIC11] Milosevic, Z.、M. Hutle 和 A. Schipper。2011.“关于将原子广播简化为具有拜占庭错误的共识”。2011 年 IEEE 第 30 届国际可靠分布式系统研讨会 (SRDS '11) 会议记录,235-244。IEEE。https://doi.org/10.1109/SRDS.2011.36

  330. [MILOSEVIC11] Milosevic, Z., M. Hutle, and A. Schiper. 2011. “On the Reduction of Atomic Broadcast to Consensus with Byzantine Faults.” In Proceedings of the 2011 IEEE 30th International Symposium on Reliable Distributed Systems (SRDS ’11), 235-244. IEEE. https://doi.org/10.1109/SRDS.2011.36.

  331. [MOHAN92] Mohan, C.、Don Haderle、Bruce Lindsay、Hamid Pirahesh 和 Peter Schwarz。1992 年。“ARIES:一种使用预写日志记录支持细粒度锁定和部分回滚的事务恢复方法。” 数据库系统交易17,没有。1(三月):94-162。https://doi.org/10.1145/128765.128770

  332. [MOHAN92] Mohan, C., Don Haderle, Bruce Lindsay, Hamid Pirahesh, and Peter Schwarz. 1992. “ARIES: a transaction recovery method supporting fine-granularity locking and partial rollbacks using write-ahead logging.” Transactions on Database Systems 17, no. 1 (March): 94-162. https://doi.org/10.1145/128765.128770.

  333. [MORARU11] Moraru、Iulian、David G. Andersen 和 Michael Kaminsky。2013.“平等主义 Paxos 正确性的证明。” https://www.pdl.cmu.edu/PDL-FTP/ Associated/CMU-PDL- 13-111.pdf 。

  334. [MORARU11] Moraru, Iulian, David G. Andersen, and Michael Kaminsky. 2013. “A Proof of Correctness for Egalitarian Paxos.” https://www.pdl.cmu.edu/PDL-FTP/associated/CMU-PDL-13-111.pdf.

  335. [MORARU13] Moraru、Iulian、David G. Andersen 和 Michael Kaminsky。2013年。“平等主义议会有更多共识。” 第二十四届 ACM 操作系统原理研讨会 (SOSP '13) 会议记录,358-372。https://doi.org/10.1145/2517349.2517350

  336. [MORARU13] Moraru, Iulian, David G. Andersen, and Michael Kaminsky. 2013. “There Is More Consensus in Egalitarian Parliaments.” In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (SOSP ’13), 358-372. https://doi.org/10.1145/2517349.2517350.

  337. [MURSHED12] Murshed、Md. Golam 和 Alastair R. Allen。2012.“同步分布式系统中领导节点选举的增强型恶霸算法”。计算机1,没有。1:3-23。https://doi.org/10.3390/computers1010003

  338. [MURSHED12] Murshed, Md. Golam, and Alastair R. Allen. 2012. “Enhanced Bully Algorithm for Leader Node Election in Synchronous Distributed Systems.” Computers 1, no. 1: 3-23. https://doi.org/10.3390/computers1010003.

  339. [NICHOLS66]尼科尔斯,安·埃尔杰霍姆。1966。“‘溢出’的过去分词:‘溢出’或‘溢出’。”《美国演讲》 41,第 1 期。1(二月):52-55。https://doi.org/10.2307/453244

  340. [NICHOLS66] Nichols, Ann Eljenholm. 1966. “The Past Participle of ‘Overflow:’ ‘Overflowed’ or ‘Overflown.’” American Speech 41, no. 1 (February): 52–55. https://doi.org/10.2307/453244.

  341. [NIEVERGELT74] Nievergelt, J. 1974。“二分搜索树和文件组织。” 1972 年 ACM-SIGFIDET 数据描述、访问和控制研讨会论文集 (SIGFIDET '72),165-187。https://doi.org/10.1145/800295.811490

  342. [NIEVERGELT74] Nievergelt, J. 1974. “Binary search trees and file organization.” In Proceedings of 1972 ACM-SIGFIDET workshop on Data description, access and control (SIGFIDET ’72), 165-187. https://doi.org/10.1145/800295.811490.

  343. [NORVIG01]诺维格,彼得。2001 年。“十年自学编程。” https://norvig.com/21-days.html

  344. [NORVIG01] Norvig, Peter. 2001. “Teach Yourself Programming in Ten Years.” https://norvig.com/21-days.html.

  345. [ONEIL93]奥尼尔、伊丽莎白·J.、帕特里克·E.奥尼尔和格哈德·韦库姆。1993 年。“数据库磁盘缓冲的 LRU-K 页面替换算法。” 1993 年 ACM SIGMOD 国际数据管理会议 (SIGMOD '93) 会议记录,297-306。https://doi.org/10.1145/170035.170081

  346. [ONEIL93] O’Neil, Elizabeth J., Patrick E. O’Neil, and Gerhard Weikum. 1993. “The LRU-K page replacement algorithm for database disk buffering.” In Proceedings of the 1993 ACM SIGMOD international conference on Management of data (SIGMOD ’93), 297-306. https://doi.org/10.1145/170035.170081.

  347. [ONEIL96]奥尼尔、帕特里克、爱德华·程、迪特·高利克和伊丽莎白·奥尼尔。1996。“日志结构合并树(LSM 树)。” 信息学报33,没有。4:351-385。https://doi.org/10.1007/s002360050048

  348. [ONEIL96] O’Neil, Patrick, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. 1996. “The log-structured merge-tree (LSM-tree).” Acta Informatica 33, no. 4: 351-385. https://doi.org/10.1007/s002360050048.

  349. [ONGARO14] Ongaro、迭戈和约翰·奥斯特豪特。2014.“寻找一种可以理解的共识算法。” 2014 年 USENIX 年度技术会议 (USENIX ATC'14) 会议记录,305-320。USENIX。

  350. [ONGARO14] Ongaro, Diego and John Ousterhout. 2014. “In Search of an Understandable Consensus Algorithm.” In Proceedings of the 2014 USENIX conference on USENIX Annual Technical Conference (USENIX ATC’14), 305-320. USENIX.

  351. [OUYANG14]欧阳健、林世鼎、宋江、侯振宇、王勇、王远征。2014 年。“SDF:用于网络规模互联网存储系统的软件定义闪存。” ACM SIGARCH 计算机架构新闻42,第 1 期 1(二月):471-484。https://doi.org/10.1145/2654822.2541959

  352. [OUYANG14] Ouyang, Jian, Shiding Lin, Song Jiang, Zhenyu Hou, Yong Wang, and Yuanzheng Wang. 2014. “SDF: software-defined flash for web-scale internet storage systems.” ACM SIGARCH Computer Architecture News 42, no. 1 (February): 471-484. https://doi.org/10.1145/2654822.2541959.

  353. [PAPADAKIS93]帕帕达基斯,托马斯。1993。“跳过列表和算法的概率分析。” 滑铁卢大学博士论文。https://cs.uwaterloo.ca/research/tr/1993/28/root2side.pdf

  354. [PAPADAKIS93] Papadakis, Thomas. 1993. “Skip lists and probabilistic analysis of algorithms.” Doctoral Dissertation, University of Waterloo. https://cs.uwaterloo.ca/research/tr/1993/28/root2side.pdf.

  355. [PUGH90a]普,威廉。1990。“跳跃列表的并发维护。” 技术报告,马里兰大学。https://drum.lib.umd.edu/handle/1903/542

  356. [PUGH90a] Pugh, William. 1990. “Concurrent Maintenance of Skip Lists.” Technical Report, University of Maryland. https://drum.lib.umd.edu/handle/1903/542.

  357. [PUGH90b]普,威廉。1990.“跳过列表:平衡树的概率替代方案。” ACM 33 的通信,编号。6(六月):668-676。https://doi.org/10.1145/78973.78977

  358. [PUGH90b] Pugh, William. 1990. “Skip lists: a probabilistic alternative to balanced trees.” Communications of the ACM 33, no. 6 (June): 668-676. https://doi.org/10.1145/78973.78977.

  359. [RAMAKRISHNAN03] Ramakrishnan、Raghu 和 Johannes Gehrke。2002。数据库管理系统(第三版)。纽约:麦格劳-希尔。

  360. [RAMAKRISHNAN03] Ramakrishnan, Raghu, and Johannes Gehrke. 2002. Database Management Systems (3rd Ed.). New York: McGraw-Hill.

  361. [RAY95] Ray、Gautam、Jayant Haritsa 和 S. Seshadri。1995 年。“数据库压缩:性能增强工具。” 第七届国际数据管理会议 (COMAD)论文集。纽约:麦格劳·希尔。

  362. [RAY95] Ray, Gautam, Jayant Haritsa, and S. Seshadri. 1995. “Database Compression: A Performance Enhancement Tool.” In Proceedings of 7th International Conference on Management of Data (COMAD). New York: McGraw Hill.

  363. [RAYNAL99] Raynal, M. 和 F. Tronel。1999。“组成员失败检测:一个简单的协议及其概率分析。” 分布式系统工程6,没有。3(九月):95-102。https://doi.org/10.1088/0967-1846/6/3/301

  364. [RAYNAL99] Raynal, M., and F. Tronel. 1999. “Group membership failure detection: a simple protocol and its probabilistic analysis.” Distributed Systems Engineering 6, no. 3 (September): 95-102. https://doi.org/10.1088/0967-1846/6/3/301.

  365. [REED78] Reed, DP 1978。“分散计算机系统中的命名和同步。” 技术报告,麻省理工学院。https://dspace.mit.edu/handle/1721.1/16279

  366. [REED78] Reed, D. P. 1978. “Naming and synchronization in a decentralized computer system.” Technical Report, MIT. https://dspace.mit.edu/handle/1721.1/16279.

  367. [REN16] Ren、Kun、Jose M. Faleiro 和 Daniel J. Abadi。2016。“高竞争下扩展多核 OLTP 的设计原则”。2016 年国际数据管理会议 (SIGMOD '16) 会议记录,1583-1598。https://doi.org/10.1145/2882903.2882958

  368. [REN16] Ren, Kun, Jose M. Faleiro, and Daniel J. Abadi. 2016. “Design Principles for Scaling Multi-core OLTP Under High Contention.” In Proceedings of the 2016 International Conference on Management of Data (SIGMOD ’16), 1583-1598. https://doi.org/10.1145/2882903.2882958.

  369. [罗宾逊08]罗宾逊,亨利。2008.“共识协议:两阶段提交。” 论文线索(博客)。2008 年 11 月 27 日。https://www.the-paper-trail.org/post/2008-11-27-consensus-protocols-two-phase-commit

  370. [ROBINSON08] Robinson, Henry. 2008. “Consensus Protocols: Two-Phase Commit.” The Paper Trail (blog). November 27, 2008. https://www.the-paper-trail.org/post/2008-11-27-consensus-protocols-two-phase-commit.

  371. [ROSENBLUM92]罗森布鲁姆、孟德尔和约翰·K·奥斯特豪特。1992.“日志结构化文件系统的设计与实现”。ACM 计算机系统汇刊10,编号。1(二月):26-52。https://doi.org/10.1145/146941.146943

  372. [ROSENBLUM92] Rosenblum, Mendel, and John K. Ousterhout. 1992. “The Design and Implementation of a Log Structured File System.” ACM Transactions on Computer Systems 10, no. 1 (February): 26-52. https://doi.org/10.1145/146941.146943.

  373. [ROY12]罗伊、Arjun G.、Mohammad K. Hossain、Arijit Chatterjee 和 William Perrizo。2012。“面向列的数据库系统:比较研究”。ISCA 第 27 届国际计算机及其应用会议论文集,264-269。

  374. [ROY12] Roy, Arjun G., Mohammad K. Hossain, Arijit Chatterjee, and William Perrizo. 2012. “Column-oriented Database Systems: A Comparison Study.” In Proceedings of the ISCA 27th International Conference on Computers and Their Applications, 264-269.

  375. [罗素12]罗素,西尔斯。2012.“带有危险指针的并发跳跃列表。” http://rsea.rs/skiplist

  376. [RUSSEL12] Russell, Sears. 2012. “A concurrent skiplist with hazard pointers.” http://rsea.rs/skiplist.

  377. [RYSTSOV16]雷斯特佐夫,丹尼斯。2016.“两全其美:Raft 联合共识 + Single Decree Paxos。” Rystsov.info(博客)。2016 年 1 月 5 日。http ://rystsov.info/2016/01/05/raft-paxos.html

  378. [RYSTSOV16] Rystsov, Denis. 2016. “Best of both worlds: Raft’s joint consensus + Single Decree Paxos.” Rystsov.info (blog). January 5, 2016. http://rystsov.info/2016/01/05/raft-paxos.html.

  379. [RYSTSOV18]丹尼斯·雷斯特佐夫。2018。“没有日志的复制状态机。” https://arxiv.org/abs/1802.07000

  380. [RYSTSOV18] Rystsov, Denis. 2018. “Replicated State Machines without logs.” https://arxiv.org/abs/1802.07000.

  381. [SATZGER07] Satzger、Benjamin、Andreas Pietzowski、Wolfgang Trumler 和 Theo Ungerer。2007 年。“一种用于可靠分布式系统的新型自适应应计故障检测器。” 2007 年 ACM 应用计算研讨会 (SAC '07) 会议记录,551-555。https://doi.org/10.1145/1244002.1244129

  382. [SATZGER07] Satzger, Benjamin, Andreas Pietzowski, Wolfgang Trumler, and Theo Ungerer. 2007. “A new adaptive accrual failure detector for dependable distributed systems.” In Proceedings of the 2007 ACM symposium on Applied computing (SAC ’07), 551-555. https://doi.org/10.1145/1244002.1244129.

  383. [SAVARD05]萨瓦德,约翰。2005 年。“浮点格式。” http://www.quadibloc.com/comp/cp0201.htm

  384. [SAVARD05] Savard, John. 2005. “Floating-Point Formats.” http://www.quadibloc.com/comp/cp0201.htm.

  385. [SCHWARZ86] Schwarz, P.、W. Chang、JC Freytag、G. Lohman、J. McPherson、C. Mohan 和 H. Pirahesh。1986 年。“Starburst 数据库系统的可扩展性。” 在OODS '86 1986 年面向对象数据库系统国际研讨会的会议记录中,85-92。IEEE。

  386. [SCHWARZ86] Schwarz, P., W. Chang, J. C. Freytag, G. Lohman, J. McPherson, C. Mohan, and H. Pirahesh. 1986. “Extensibility in the Starburst database system.” In OODS ’86 Proceedings on the 1986 international workshop on Object-oriented database systems, 85–92. IEEE.

  387. [SEDGEWICK11]罗伯特·塞奇威克和凯文·韦恩。2011.算法(第四版)。波士顿:皮尔逊。

  388. [SEDGEWICK11] Sedgewick, Robert, and Kevin Wayne. 2011. Algorithms (4th Ed.). Boston: Pearson.

  389. [SHAPIRO11a]马克·夏皮罗、努诺·普雷吉萨、卡洛斯·巴克罗和马雷克·扎维尔斯基。2011。“无冲突复制数据类型。” 分布式系统的稳定性、安全性和保障,386-400。柏林:施普林格、柏林、海德堡。

  390. [SHAPIRO11a] Shapiro, Marc, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. 2011. “Conflict-free Replicated Data Types.” In Stabilization, Safety, and Security of Distributed Systems, 386-400. Berlin: Springer, Berlin, Heidelberg.

  391. [SHAPIRO11b]马克·夏皮罗、努诺·普雷吉萨、卡洛斯·巴克罗和马雷克·扎维尔斯基。2011。“收敛和交换复制数据类型的综合研究。” https://hal.inria.fr/inria-00555588/document

  392. [SHAPIRO11b] Shapiro, Marc, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. 2011. “A comprehensive study of Convergent and Commutative Replicated Data Types.” https://hal.inria.fr/inria-00555588/document.

  393. [SHEEHY10a]希伊,贾斯汀。2010.“为什么矢量时钟很难。” 里亚克(博客)。2010 年 4 月 5 日。https ://riak.com/posts/technical/why-vector-clocks-are-hard

  394. [SHEEHY10a] Sheehy, Justin. 2010. “Why Vector Clocks Are Hard.” Riak (blog). April 5, 2010. https://riak.com/posts/technical/why-vector-clocks-are-hard.

  395. [SHEEHY10b] Sheehy、贾斯汀和大卫·史密斯。2010.“Bitcask,用于快速键/值数据的日志结构哈希表。”

  396. [SHEEHY10b] Sheehy, Justin, and David Smith. 2010. “Bitcask, A Log-Structured Hash Table for Fast Key/Value Data.”

  397. [SILBERSCHATZ10] Silberschatz、亚伯拉罕、Henry F. Korth 和 S. Sudarshan。2010。数据库系统概念(第六版)。纽约:麦格劳-希尔。

  398. [SILBERSCHATZ10] Silberschatz, Abraham, Henry F. Korth, and S. Sudarshan. 2010. Database Systems Concepts (6th Ed.). New York: McGraw-Hill.

  399. [SINHA97] Sinha, Pradeep K. 1997。分布式操作系统:概念与设计。新泽西州霍博肯:威利。

  400. [SINHA97] Sinha, Pradeep K. 1997. Distributed Operating Systems: Concepts and Design. Hoboken, NJ: Wiley.

  401. [SKEEN82]斯基恩,戴尔。1982 年。“基于群体的提交协议。” 技术报告,康奈尔大学。

  402. [SKEEN82] Skeen, Dale. 1982. “A Quorum-Based Commit Protocol.” Technical Report, Cornell University.

  403. [SKEEN83] Skeen、Dale 和 M. Stonebraker。1983。“分布式系统中崩溃恢复的正式模型。” IEEE 软件工程汇刊,第 9 期。3(五月):219-228。https://doi.org/10.1109/TSE.1983.236608

  404. [SKEEN83] Skeen, Dale, and M. Stonebraker. 1983. “A Formal Model of Crash Recovery in a Distributed System.” IEEE Transactions on Software Engineering 9, no. 3 (May): 219-228. https://doi.org/10.1109/TSE.1983.236608.

  405. [SOUNDARARARJAN06] Soundararajan,Gokul。2006 年。“在 Apache Derby 进度报告中实现更好的缓存替换算法。” https://pdfs.semanticscholar.org/220b/2fe62f13478f1ec75cf17ad085874689c604.pdf

  406. [SOUNDARARARJAN06] Soundararajan, Gokul. 2006. “Implementing a Better Cache Replacement Algorithm in Apache Derby Progress Report.” https://pdfs.semanticscholar.org/220b/2fe62f13478f1ec75cf17ad085874689c604.pdf.

  407. [STONE98] Stone, J.、M. Greenwald、C. Partridge 和 J. Hughes。1998 年。“校验和和 CRC 在实际数据上的性能。” IEEE/ACM 网络交易6,编号。5(十月):529-543。https://doi.org/10.1109/90.731187

  408. [STONE98] Stone, J., M. Greenwald, C. Partridge and J. Hughes. 1998. “Performance of checksums and CRCs over real data.” IEEE/ACM Transactions on Networking 6, no. 5 (October): 529-543. https://doi.org/10.1109/90.731187.

  409. [TANENBAUM14] Tanenbaum、安德鲁·S. 和赫伯特·博斯​​。2014。现代操作系统(第四版)。上萨德尔河:Prentice Hall Press。

  410. [TANENBAUM14] Tanenbaum, Andrew S., and Herbert Bos. 2014. Modern Operating Systems (4th Ed.). Upper Saddle River: Prentice Hall Press.

  411. [TANENBAUM06] Tanenbaum、Andrew S. 和 Maarten van Steen。2006。分布式系统:原理和范式。波士顿:皮尔逊。

  412. [TANENBAUM06] Tanenbaum, Andrew S., and Maarten van Steen. 2006. Distributed Systems: Principles and Paradigms. Boston: Pearson.

  413. [TARIQ11]塔里克,奥瓦伊斯。2011.“了解 InnoDB 聚集索引。” 奥瓦伊斯·塔里克(博客)。2011 年 1 月 20 日。http ://www.ovaistariq.net/521/understanding-innodb-clustered-indexes/#.XTtaUpNKj5Y

  414. [TARIQ11] Tariq, Ovais. 2011. “Understanding InnoDB clustered indexes.” Ovais Tariq (blog). January 20, 2011. http://www.ovaistariq.net/521/understanding-innodb-clustered-indexes/#.XTtaUpNKj5Y.

  415. [TERRY94] Terry、Douglas B.、Alan J. Demers、Karin Petersen、Mike J. Spreitzer、Marvin M. Theimer 和 Brent B. Welch。1994 年。“弱一致性复制数据的会话保证”。在PDIS '94 第三届并行和分布式信息系统国际会议论文集,140–149。IEEE。

  416. [TERRY94] Terry, Douglas B., Alan J. Demers, Karin Petersen, Mike J. Spreitzer, Marvin M. Theimer, and Brent B. Welch. 1994. “Session Guarantees for Weakly Consistent Replicated Data.” In PDIS ’94 Proceedings of the Third International Conference on Parallel and Distributed Information Systems, 140–149. IEEE.

  417. [THOMAS79] Thomas, Robert H. 1979。“多副本数据库并发控制的多数共识方法。” ACM 数据库系统事务4,编号。2(六月):180-209。https://doi.org/10.1145/320071.320076

  418. [THOMAS79] Thomas, Robert H. 1979. “A majority consensus approach to concurrency control for multiple copy databases.” ACM Transactions on Database Systems 4, no. 2 (June): 180–209. https://doi.org/10.1145/320071.320076.

  419. [THOMSON12] Thomson、Alexander、Thaddeus Diamond、Shu-Chun Weng、Kun Ren、Philip Shao 和 Daniel J. Abadi。2012 年。“Calvin:分区数据库系统的快速分布式事务。” ACM SIGMOD 国际数据管理会议 (SIGMOD '12) 会议记录。纽约:计算机协会。https://doi.org/10.1145/2213836.2213838

  420. [THOMSON12] Thomson, Alexander, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. 2012. “Calvin: Fast distributed transactions for partitioned database systems.” In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD ’12). New York: Association for Computing Machinery. https://doi.org/10.1145/2213836.2213838.

  421. [VANRENESSE98] van Renesse、罗伯特、亚伦·明斯基和马克·海登。1998.“八卦式故障检测服务。” 中间件'98 IFIP 国际会议分布式系统平台和开放分布式处理会议记录,55-70。伦敦:施普林格出版社。

  422. [VANRENESSE98] van Renesse, Robbert, Yaron Minsky, and Mark Hayden. 1998. “A Gossip-Style Failure Detection Service.” In Middleware ’98 Proceedings of the IFIP International Conference on Distributed Systems Platforms and Open Distributed Processing, 55–70. London: Springer-Verlag.

  423. [VENKATARAMAN11] Venkataraman、Shivaram、Niraj Tolia、Parthasarathy Ranganathan 和 Roy H. Campbell。2011.“非易失性字节可寻址存储器的一致且持久的数据结构。” 在第九届 USENIX 文件和存储技术会议 (FAST'11) 的会议记录中,5. USENIX。

  424. [VENKATARAMAN11] Venkataraman, Shivaram, Niraj Tolia, Parthasarathy Ranganathan, and Roy H. Campbell. 2011. “Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory.” In Proceedings of the 9th USENIX conference on File and stroage technologies (FAST’11), 5. USENIX.

  425. [VINOSKI08]维诺斯基,史蒂夫。2008.“方便胜过正确性。” IEEE 互联网计算12,编号。4(八月):89-92。https://doi.org/10.1109/MIC.2008.75

  426. [VINOSKI08] Vinoski, Steve. 2008. “Convenience Over Correctness.” IEEE Internet Computing 12, no. 4 (August): 89–92. https://doi.org/10.1109/MIC.2008.75.

  427. [VIOTTI16]维奥蒂、保罗和马尔科·武科利奇。2016。“非事务性分布式存储系统的一致性。” ACM 计算调查49,编号。1(7 月):第 19 条。https://doi.org/0.1145/2926965

  428. [VIOTTI16] Viotti, Paolo, and Marko Vukolić. 2016. “Consistency in Non-Transactional Distributed Storage Systems.” ACM Computing Surveys 49, no. 1 (July): Article 19. https://doi.org/0.1145/2926965.

  429. [VOGELS09]沃格尔斯,沃纳。2009年。“最终一致。” ACM 52 的通信,编号。1(一月):40-44。https://doi.org/10.1145/1435417.1435432

  430. [VOGELS09] Vogels, Werner. 2009. “Eventually consistent.” Communications of the ACM 52, no. 1 (January): 40–44. https://doi.org/10.1145/1435417.1435432.

  431. [WALDO96]沃尔多、吉姆、杰夫·怀特、安·沃尔拉斯和塞缪尔·C·肯德尔。1996 年。“分布式计算注释”。精选演讲和特邀论文第二届移动对象系统国际研讨会——迈向可编程互联网(7 月):49-64。https://dl.acm.org/itation.cfm?id=747342

  432. [WALDO96] Waldo, Jim, Geoff Wyant, Ann Wollrath, and Samuel C. Kendall. 1996. “A Note on Distributed Computing.” Selected Presentations and Invited Papers SecondInternational Workshop on Mobile Object Systems—Towards the Programmable Internet (July): 49–64. https://dl.acm.org/citation.cfm?id=747342.

  433. [WANG13]王,彭,孙光宇,宋江,欧阳剑,林世鼎,张晨,丛杰森。2014年。“基于LSM树的键值存储在开放通道SSD上的有效设计和实现”。EuroSys '14 第九届欧洲计算机系统会议(4 月)会议记录:第 16 条。https: //doi.org/10.1145/2592798.2592804

  434. [WANG13] Wang, Peng, Guangyu Sun, Song Jiang, Jian Ouyang, Shiding Lin, Chen Zhang, and Jason Cong. 2014. “An Efficient Design and Implementation of LSM-tree based Key-Value Store on Open-Channel SSD.” EuroSys ’14 Proceedings of the Ninth European Conference on Computer Systems (April): Article 16. https://doi.org/10.1145/2592798.2592804.

  435. [WANG18] Wang、Ziqi、Andrew Pavlo、Hyeontaek Lim、Viktor Leis、Huanchen Chang、Michael Kaminsky 和 ​​David G. Andersen。2018.“构建 Bw-Tree 需要的不仅仅是流行词。” 2018 年国际数据管理会议 (SIGMOD '18) 会议记录,473–488。https://doi.org/10.1145/3183713.3196895

  436. [WANG18] Wang, Ziqi, Andrew Pavlo, Hyeontaek Lim, Viktor Leis, Huanchen Zhang, Michael Kaminsky, and David G. Andersen. 2018. “Building a Bw-Tree Takes More Than Just Buzz Words.” Proceedings of the 2018 International Conference on Management of Data (SIGMOD ’18), 473–488. https://doi.org/10.1145/3183713.3196895.

  437. [WEIKUM01] Weikum、格哈德和戈特弗里德·沃森。2001.事务信息系统:并发控制和恢复的理论、算法和实践。旧金山:摩根考夫曼出版公司

  438. [WEIKUM01] Weikum, Gerhard, and Gottfried Vossen. 2001. Transactional Information Systems: Theory, Algorithms, and the Practice of Concurrency Control and Recovery. San Francisco: Morgan Kaufmann Publishers Inc.

  439. [XIA17]夏,费,姜德军,金雄,孙宁辉。2017 年。“HiKV:用于 DRAM-NVM 内存系统的混合索引键值存储。” 2017 年 USENIX 年度技术会议论文集 (USENIX ATC '17),349–362。USENIX。

  440. [XIA17] Xia, Fei, Dejun Jiang, Jin Xiong, and Ninghui Sun. 2017. “HiKV: A Hybrid Index Key-Value Store for DRAM-NVM Memory Systems.” Proceedings of the 2017 USENIX Annual Technical Conference (USENIX ATC ’17), 349–362. USENIX.

  441. [YANG14] Yang,Jingpei,Ned Plasson,Greg Gillis,Nisha Talagala 和 Swaminathan Sundararaman。2014 年。“不要把你的日志堆在我的日志上。” 流入(十月)。https://www.usenix.org/system/files/conference/inflow14/inflow14-yang.pdf

  442. [YANG14] Yang, Jingpei, Ned Plasson, Greg Gillis, Nisha Talagala, and Swaminathan Sundararaman. 2014. “Don’t stack your Log on my Log.” INFLOW (October). https://www.usenix.org/system/files/conference/inflow14/inflow14-yang.pdf.

  443. [ZHAO15]赵文兵. 2015 年。“快速 Paxos 变得简单:理论与实施。” 国际分布式系统与技术杂志6,no。1(一月):15-33。https://doi.org/10.4018/ijdst.2015010102

  444. [ZHAO15] Zhao, Wenbing. 2015. “Fast Paxos Made Easy: Theory and Implementation.” International Journal of Distributed Systems and Technologies 6, no. 1 (January): 15-33. https://doi.org/10.4018/ijdst.2015010102.

指数

Index

A

A

B

C

C

D

D

E

F

F

G

G

H

H

I

K

K

L

L

中号

M

N

O

P

Q

R

S

S

时间

T

U

U

V

V

W

Y

Z

Z

关于作者

About the Author

Alex Petrov是一位数据基础设施工程师、数据库和存储系统爱好者、Apache Cassandra 提交者以及对存储、分布式系统和算法感兴趣的 PMC 成员。

Alex Petrov is a data infrastructure engineer, database and storage systems enthusiast, Apache Cassandra committer, and PMC member interested in storage, distributed systems, and algorithms.

版画

Colophon

《数据库内部》封面上的动物是孔雀比目鱼,这是Bothus lunatusBothus mancus的名字分别是大西洋中部和印度太平洋沿岸浅水水域的居民

The animal on the cover of Database Internals is the peacock flounder, a name given to both Bothus lunatus and Bothus mancus, inhabitants of the shallow coastal waters of the mid-Atlantic and Indo-Pacific ocean, respectively.

虽然蓝色花卉图案的皮肤赋予了孔雀比目鱼这个绰号,但这些比目鱼能够根据周围的环境改变它们的外观。这种伪装能力可能与鱼的视力有关,因为如果遮住一只眼睛,它就无法改变其外观。

While the blue floral-patterned skin gives the peacock flounder its moniker, these flounders have the ability to change their appearance based on their immediate surroundings. This camouflage ability may be related to the fish’s vision, because it is unable to change its appearance if one of its eyes is covered.

成年比目鱼以水平姿态游泳,而不是像大多数其他鱼类那样以垂直、后仰/腹部朝下的方向游泳。当比目鱼游泳时,它们往往只从底部滑行一英寸(2.54厘米)左右,同时紧紧跟随海底的轮廓。

这种比目鱼的一只眼睛在成熟过程中会迁移到另一只眼睛的一侧,从而使鱼能够同时向前和向后看。可以理解的是,孔雀比目鱼不是垂直游泳,而是倾向于从海底滑行一英寸左右,紧贴地形轮廓,其有图案的一面始终朝上。

One of this flat fish’s eyes migrates during maturation to join the other on a single side, allowing the fish to look both forward and backward at once. Understandably, rather than swim vertically, the peacock flounder tends to glide an inch or so off the sea floor, closely following the contour of the terrain with its patterned side always facing up.

虽然孔雀比目鱼目前的保护状态被指定为最不关心的状态,但 O'Reilly 封面上的许多动物都处于濒危状态;所有这些对世界都很重要。

While the peacock flounder’s current conservation status is designated as of Least Concern, many of the animals on O’Reilly covers are endangered; all of them are important to the world.

封面插图由凯伦·蒙哥马利 (Karen Montgomery) 创作,以洛瑞自然历史博物馆的黑白版画为基础。封面字体为 Gilroy Semibold 和 Guardian Sans。文字字体为Adobe Minion Pro;标题字体为 Adob​​e Myriad Condensed;代码字体是Dalton Maag的Ubuntu Mono。

The cover illustration is by Karen Montgomery, based on a black and white engraving from Lowry’s The Museum of Natural History. The cover fonts are Gilroy Semibold and Guardian Sans. The text font is Adobe Minion Pro; the heading font is Adobe Myriad Condensed; and the code font is Dalton Maag’s Ubuntu Mono.